[PDF] Privacy-Preserving Feature Selection with Secure Multiparty Computation

Abstract

Existing work on privacy-preserving machine learning with Secure Multiparty Computation (MPC) is almost exclusively focused on model training and on inference with trained models, thereby overlooking the important data pre-processing stage. In this work, we propose the first MPC based protocol for private feature selection based on the filter method, which is independent of model training, and can be used in combination with any MPC protocol to rank features. We propose an efficient feature scoring protocol based on Gini impurity to this end. To demonstrate the feasibility of our approach for practical data science, we perform experiments with the proposed MPC protocols for feature selection in a commonly used machine-learning-as-a-service configuration where computations are outsourced to multiple servers, with semi-honest and with malicious adversaries. Regarding effectiveness, we show that secure feature selection with the proposed protocols improves the accuracy of classifiers on a variety of real-world data sets, without leaking information about the feature values or even which features were selected. Regarding efficiency, we document runtimes ranging from several seconds to an hour for our protocols to finish, depending on the size of the data set and the security settings.

Full PDF

11 Privacy-Preserving Feature Selection with SecureMultiparty Computation

Xiling Li, Rafael Dowsley and Martine De Cock

Abstract —Existing work on privacy-preserving machine learn-ing with Secure Multiparty Computation (MPC) is almost exclu-sively focused on model training and on inference with trainedmodels, thereby overlooking the important data pre-processingstage. In this work, we propose the ﬁrst MPC based protocolfor private feature selection based on the ﬁlter method, which isindependent of model training, and can be used in combinationwith any MPC protocol to rank features. We propose an efﬁcientfeature scoring protocol based on Gini impurity to this end. Todemonstrate the feasibility of our approach for practical datascience, we perform experiments with the proposed MPC proto-cols for feature selection in a commonly used machine-learning-as-a-service conﬁguration where computations are outsourced tomultiple servers, with semi-honest and with malicious adversaries.Regarding effectiveness, we show that secure feature selectionwith the proposed protocols improves the accuracy of classiﬁerson a variety of real-world data sets, without leaking informationabout the feature values or even which features were selected.Regarding efﬁciency, we document runtimes ranging from severalseconds to an hour for our protocols to ﬁnish, depending on thesize of the data set and the security settings.

I. I

NTRODUCTION

Machine learning (ML) thrives because of the availability ofan abundant amount of data, and of computational resourcesand devices to collect and process such data. In many effectiveML applications, the data that is consumed during ML modeltraining and inference is often of a very personal nature.Protection of user data has become a signiﬁcant concernin ML model development and deployment, giving rise tolaws to safeguard the privacy of users, such as the EuropeanGeneral Data Protection Regulation (GDPR) and the CaliforniaCustomer Privacy Act (CCPA). Cryptographic protocols thatallow computations on encrypted data are an increasinglyimportant mechanism to enable data science applicationswhile complying with privacy regulations. In this paper, wecontribute to the ﬁeld of privacy-preserving machine learning(PPML), a burgeoning and interdisciplinary research area at theintersection of cryptography and ML that has gained signiﬁcanttraction in tackling privacy issues.In particular, we use techniques from Secure MultipartyComputation (MPC), an umbrella term for cryptographicapproaches that allow two or more parties to jointly compute aspeciﬁed output from their private information in a distributedfashion, without actually revealing their private information

Xiling Li is with the School of Engineering and Technology, University ofWashington, Tacoma, WA, USA. Email: [email protected] Dowsley is with the Faculty of Information Technology, MonashUniversity, Clayton, Australia. Email: [email protected] De Cock is with the School of Engineering and Technology,University of Washington, Tacoma, WA, USA and Ghent University, Ghent,Belgium. Email: [email protected] to each other [12]. We consider the scenario where differentdata owners or enterprises are interested in training an MLmodel over their combined data. There is a lot of potential intraining ML models over the aggregated data from multipleenterprises. First of all, training on more data typically yieldshigher quality ML models. For instance, one could train amore accurate model to predict the length of hospital stayof COVID-19 patients when combining data from multipleclinics. This is an application where the data is horizontallydistributed , meaning that each data owner or enterprise hasrecords/rows of the data. Furthermore, being able to combinedifferent data sets enables new applications that pool togetherdata from multiple enterprises, or even from different entitieswithin the same enterprise. An example of this would be anML model that relies on lab test results as well as healthcarebill payment information about patients, which are usuallymanaged by different departments within a hospital system.This is an example of an application where the data is verticallydistributed , i.e. each data owner has their own columns. Whilethere are clear advantages to training ML models over datathat is distributed across multiple data owners, often these dataowners do not want to disclose their data to each other, becausethe data in itself constitutes a competitive advantage, or becausethe data owners need to comply with data privacy regulations.These roadblocks can even affect different departments withinthe same enterprise, such as different clinics within a healthcaresystem.During the last decade, cryptographic protocols designedwith MPC have been developed for training of ML models overaggregated data, without the need for the individual data ownersor enterprises to reveal their data to anyone in an unencryptedmanner. This existing work includes MPC protocols for trainingof decision tree models [26], [17], [11], [1], linear regressionmodels [29], [15], [2], and neural network architectures [28],[3], [34], [21], [16]. Existing approaches assume that the datasets are pre-processed and clean, with features that have beenpre-selected and constructed. In practical data science projects,model building constitutes only a small part of the workﬂow:real-world data sets must be cleaned and pre-processed, outliersmust be removed, training features must be selected, andmissing values need to be addressed before model trainingcan begin. Data scientists are estimated to spend 50% to 80%of their time on data wrangling as opposed to model trainingitself [27]. PPML solutions will not be adopted in practice ifthey do not encompass these data preparation steps. Indeed,there is little point in preserving the privacy of clean data setsduring model training – which is currently already possible –if the raw data has to be leaked ﬁrst to arrive at those cleandata sets! a r X i v : . [ c s . CR ] F e b Fig. 1. Overview of private feature selection and model training in 3PC setting with computing servers (parties)

Alice , Bob , and

Carol . In this paper, we contribute to ﬁlling this gap in the openliterature by proposing the ﬁrst MPC based protocol for privacy-preserving feature selection . Feature selection is the processof selecting a subset of relevant features for model training[10]. Using a well chosen subset of features can lead to moreaccurate models, as well as efﬁciency gains during modeltraining. A commonly used technique for feature selectionis the so-called ﬁlter method in which features are rankedaccording to a score indicative of their predictive ability, andsubsequently the highest ranked features are retained. Despiteof its known shortcomings, including the fact that it considerseach feature in isolation and ignores feature dependencies, theﬁlter method is popular in practical data science because it iscomputationally very efﬁcient, and independent of any speciﬁcML model architecture.The MPC based protocol π FILTER − FS for private feature se-lection that we propose in this paper can be used in combinationwith any MPC protocol to rank features in a privacy-preservingmanner. Well-known techniques to score features in terms oftheir informativeness include mutual information (MI), Giniimpurity (GI), and Pearson’s correlation coefﬁcient (PCC). Wepropose an efﬁcient feature scoring protocol π MS − GINI based onGini impurity, leaving the development of privacy-preservingprotocols for other feature scoring techniques as future work.The computation of a GI score for continuous valued featurestraditionally requires sorting of the feature values to determinecandidate split points in the feature value range. As sortingis an expensive operation to perform in a privacy-preservingway, we instead propose a “mean-split Gini score” (MS-GINI)that avoids the need for sorting by selecting the mean of thefeature values as the split point. As we show in Sec. V, featureselection with MS-GINI leads to accuracy improvements thatare on par with those obtained with GI, PCC, and MI in the datasets used in our experiments. Depending on the applicationand the data set at hand, one may want to use a differentfeature scoring technique, in combination with our protocol π FILTER − FS for private feature selection.Fig. 1 illustrates the ﬂow of private feature selection andsubsequent model training at a high level in an outsourced “MLas a service setting” with three computing servers, nicknamed Alice , Bob , and

Carol (three-party computation, 3PC). 3PC withhonest majority, i.e. with at most one server being corrupted,is a conﬁguration that is often used in MPC because this setupallows for some of the most efﬁcient MPC schemes. In Step 1of Fig. 1, each of m data owners sends secret shares of theirdata to the three servers (parties). While the secret shared data can be trivially revealed by combining shares, no informationabout the data is revealed by the shares received by any singleserver, meaning that none of the servers by themselves learnanything about the actual values of the data. In Step 2A, thethree servers execute protocols π MS − GINI and π FILTER − FS tocreate a reduced version of the data set that contains only theselected features. Throughout this process, none of the partieslearns the values of the data or even which features are selected,as all computations are done over secret shares. Next, in Step2B, the parties train an ML model over the pre-processed datausing existing privacy-preserving training protocols, e.g., aprivacy-preserving protocol for logistic regression training [16].Finally, in Step 3, the servers can disclose the trained modelto the intended model owner by revealing their shares. Steps1 and 3 are trivial as they follow directly from the choice ofthe underlying MPC scheme (see Sec. II-B). MPC protocolsfor Step 2B have previously been proposed. The focus of thispaper is on Step 2A. Our approach works in scenarios wherethe data is horizontally partitioned (each data owner has oneor more of the rows or instances), scenarios where the data isvertically partitioned (each data owner has some of the columnsor attributes), or any other partition.After presenting preliminaries about Gini impurity and MPCin Sec. II, and discussing related work in Sec. III, we presentour main protocol π FILTER − FS for private feature selection andthe supporting protocols π GINI − FS and π MS − GINI in Sec. IV.In Sec. V we demonstrate the feasibility of our approachfor practical data science in terms of accuracy and runtimeresults through experiments executed on real-world data sets.In our experiments, we consider honest-majority 3PC settingswith semi-honest as well as malicious adversaries. Whileparties corrupted by semi-honest adversaries follow the protocolinstructions correctly but try to obtain additional information,parties corrupted by malicious adversaries can deviate from theprotocol instructions. Defending against the latter comes at ahigher computational cost which, as we show, can be mitigatedby using a recently proposed MPC scheme for 4PC.II. P

RELIMINARIES

A. Feature Selection based on Gini Impurity

Assume that we have a set S of m training examples, whereeach training example consists of an input feature vector ( x , . . . , x p ) and a corresponding label y . Throughout thispaper, we assume that there are n possible class labels. Wewish to induce an ML model from this training data that can infer, for a previously unseen input feature vector, a label y as accurately as possible. Not all p features may be equallybeneﬁcial to this end. In the ﬁlter approach to feature selection,all features are ﬁrst assigned a score that is indicative oftheir predictive ability. Subsequently only the best scoringfeatures are retained. A well-known feature scoring criterionis Gini impurity, made popular as part of the classiﬁcation andregression tree algorithm (CART) [7].If the j th feature F j is a discrete feature that can assume (cid:96) different values, then it induces a partition S ∪ S ∪ . . . ∪ S (cid:96) of S in which S i is the set of instances that have the i th valuefor the j th feature. The Gini impurity of S i is deﬁned as: G ( S i ) = n (cid:88) c =1 p c · (1 − p c ) = 1 − n (cid:88) c =1 p c (1)where p c is the probability of a randomly selected instancefrom S i belonging to the c th class. The Gini score of feature F j is a weighted average of the Gini impurities of the S i ’s: G ( F j ) = (cid:96) (cid:88) i =1 | S i | m · G ( S i ) (2)Conceptually, G ( F j ) estimates the likelihood of a randomlyselected instance to be misclassiﬁed based on knowledge ofthe value of the j th feature. During feature selection, the k features with the lowest Gini scores are retained.If F j is a feature with continuous values, then G ( F j ) isdeﬁned as the weighted average of the Gini impurities of aset S ≤ θ containing all instances for which the j th featurevalue is smaller than or equal to θ , and a set S >θ with allinstances for which the j th feature value is larger than θ . In theCART algorithm, an optimal threshold θ is determined basedon sorting of all the instances on their feature values. Sinceprivacy-preserving sorting is a time-consuming operation inMPC [6], [20], in Sec. IV-B we propose a more straightforwardapproach for threshold selection which, as we show in Sec. V,yields desirable improvements in accuracy. B. Secure Multiparty Computation

Protocols for MPC enable a set of parties to jointly computethe output of a function over each of the parties’ private inputs,without requiring parties to disclose their input to anyone.MPC is concerned with the protocol execution coming underattack by an adversary which may corrupt parties to learnprivate information or cause the result of the computation to beincorrect. MPC protocols are designed to prevent such attacksbeing successful, and use proven cryptographic techniques toguarantee privacy.

Adversarial Model:

An adversary A can corrupt anynumber of parties. In a dishonest-majority setting, half ormore of the parties may be corrupt, while in an honest-majority setting, more than half of the parties are honest (notcorrupted). Furthermore, A can be a semi-honest or a malicious adversary. While a party corrupted by a semi-honest or “passive”adversary follows the protocol instructions correctly but tries toobtain additional information, parties corrupted by malicious or“active” adversaries can deviate from the protocol instructions.The protocols in Sec. IV are sufﬁciently generic to be used in dishonest-majority as well as honest-majority settings, withpassive or active adversaries. This is achieved by changingthe underlying MPC scheme to align with the desired securitysetting. Some of the most efﬁcient MPC schemes have beendeveloped for 3 parties, out of which at most one is corrupted.We evaluate the runtime of our protocols in this honest-majority3PC setting, which is growing in popularity in the PPMLliterature, e.g. [14], [24], [31], [34], and we demonstrate howeven better runtimes can be obtained with a recently proposedMPC scheme for 4PC with one corruption [13].In the MPC schemes used in this paper, all computations bythe parties (servers) are done over integers in a ring Z q . Rawdata in ML applications is often real-valued. As is common inthe MPC literature, we convert real numbers to integers usinga ﬁxed-point representation [9]. After this conversion, the dataowners secret share their values with the parties using a secretsharing scheme and proceed by performing operations over thesecret shares.For the passive 3PC setting, we follow a replicated secretsharing scheme from Araki et al. ([4]). To share a secret value x ∈ Z q among parties P , P and P , the shares x , x , x are chosen uniformly at random in Z q with the constraintthat x + x + x = x mod q . P receives x and x , P receives x and x , and P receives x and x . Note that itis necessary to combine the shares available to two parties inorder to recover x , and no information about the secret sharedvalue x is revealed to any single party. For short, we denotethis secret sharing by [[ x ]] q . Let [[ x ]] q , [[ y ]] q be secret sharedvalues and c be a constant, the following computations can bedone locally by parties without communication: • Addition ( z = x + y ): Each party P i gets shares of z bycomputing z i = x i + y i and z ( i +1 mod 3) = x ( i +1 mod 3) + y ( i +1 mod 3) . This is denoted by [[ z ]] q ← [[ x ]] q + [[ y ]] q . • Subtraction [[ z ]] q ← [[ x ]] q − [[ y ]] q is performed analogously. • Multiplication by a constant ( z = c · x ): Each party multipliesits local shares of x by c to obtain shares of z . This is denotedby [[ z ]] q ← c · [[ x ]] q • Addition of a constant ( z = x + c ): P and P add c to theirshare x of x to obtain z , while the parties set z = x and z = x . This will be denoted by [[ z ]] q ← [[ x ]] q + c .The main advantage of replicated secret sharing compared toother secret sharing schemes is that replicated shares enablesa very efﬁcient procedure for multiplying secret shared values.To compute x · y = ( x + x + x )( y + y + y ) , the partieslocally perform the following computations: P computes z = x · y + x · y + x · y , P computes z = x · y + x · y + x · y and P computes z = x · y + x · y + x · y . By doing so, without any interaction, each P i obtains z i such that z + z + z = x · y mod q . After that, theparties are required to convert from this additive secret sharingrepresentation back to the original replicated secret sharingrepresentation (which requires that the parties add a secretsharing of zero and that each party sends one share to oneother party for a total communication of three shares). See [4]for more details.In the active 3PC setting, we use the MPC scheme SYReplicated2k recently proposed by Dalskov et al. ([13]). Inthis MPC scheme, the parties are prevented from deviating fromthe protocol and from gaining knowledge from other parties through the use of information-theoretic message authenticationcodes (MACs). In addition to computations over secret sharesof the data, the parties also perform computations required forMACs. See [13] for details. Finally, we use the MPC schemerecently proposed by Dalskov et al. ([13]) for the active 4PC setting, where the computations are outsourced to four serversout of which at most one has been corrupted by a maliciousadversary.

Building Blocks:

Building on the cryptographic primitiveslisted above for addition and multiplication of secret sharedvalues, MPC protocols for other operations have been developedin the literature. In this paper, we use: • Secure matrix multiplication π DMM : at the start of thisprotocol, the parties have secret sharings [[ A ]] and [[ B ]] ofmatrices A and B ; at the end, the parties have a secret sharing [[ C ]] of the product of the matrices, C = A × B . π DMM can beconstructed as a direct extension of the secure multiplicationprotocol for two integers, which we will denote as π DM inthe remainder of the paper. Similarly, we use π DP to denotethe protocol for the secure dot product of two vectors. Ina replicated sharing scheme, dot products can be computedmore efﬁciently than the direct extension from π DM , andmatrix multiplication can use this optimized version of dotproducts; we refer to Keller ([23]) for details. • Secure comparison protocol π LT [8]: at the start of thisprotocol, the parties have secret sharings [[ x ]] and [[ y ]] of twointegers x and y ; at the end, they have a secret sharing of if x < y , and a secret sharing of otherwise. • Secure argmin protocol π ARGMIN : this protocol accepts secretsharings of a vector of integers and returns a secret sharingof the index at which the vector has the minimum value. π ARGMIN is straightforwardly constructed using the abovementioned secure comparison protocol. • Secure equality test protocol π EQ [9]: at the start of thisprotocol, the parties have secret sharings [[ x ]] and [[ y ]] of twointegers x and y ; at the end, they have a secret sharing of if x = y , and a secret sharing of otherwise. • Secure division protocol π DIV [9]: at the start of this protocol,the parties have secret sharings [[ x ]] q and [[ y ]] q of two integers x and y ; at the end, they have a secret sharing [[ z ]] q of z = x/y . III. R ELATED W ORK

Private Feature Selection:

Given that feature selection is animportant step in the data preparation pipeline, it has receivedremarkably little attention in the PPML literature to date.Feature selection techniques have been proposed that favorfeatures that do not contain sensitive information [22]. Worklike that is orthogonal to ours, as it assumes the existenceof a data curator with full access to all the data. Regardingapproaches to private feature selection among multiple dataowners, early attempts [5], [32] in the semi-honest setting usea “distributed secure sum protocol” reminiscent of the way inwhich sums are computed in MPC based on secret sharing (seeSec. II-B). The limitations of this work in terms of securityinclude the fact that the parties ﬁnd out which features areselected, and statistical information about the data is leaked toall parties during the computation of the feature scores, as onlysummations, and not other operations, are done in a secure manner. [30] proposed a more principled 2PC protocol withPaillier homomorphic encryption for private feature selectionwith χ as ﬁlter criteria in the semi-honest setting, withoutan experimental evaluation of the proposed approach. To thebest of our knowledge, private feature selection with maliciousadversaries has not yet been proposed or evaluated. The recentapproach by [35] is not based on cryptography, does notprovide any formal privacy guarantees, and leaks informationthrough disclosure of intermediate representations. Secure Gini Score Computation:

Besides as a technique toscore features for feature selection, as we do in this paper, Giniimpurity is traditionally used in ML in the CART algorithmfor training decision trees [7], and it has been adopted inMPC protocols for privacy-preserving training of decision treemodels [17], [11], [1]. Gini score computation for continuousvalued features, as we do in this paper, is especially challengingfrom an MPC point of view, as it requires sorting of featurevalues to determine candidate split points in the feature range.Abspoel et al. ([1]) put ample effort in performing this sortingprocess as efﬁciently as possible in a secure manner. We take adrastically different approach by assuming that the mean of thefeature values serves as a good approximation for an optimalsplit threshold. This has the double advantage that (1) thereis no need for oblivious sorting of feature values, and (2) foreach feature only one Gini score for one threshold θ has to becomputed as opposed to computing the Gini score for multiplecandidate thresholds and then selecting the best one throughsecure comparisons. This leads to signiﬁcant efﬁciency gains,while preserving good accuracy, as we demonstrate in Sec. V. Protocol 1

Protocol π FILTER − FS for Secure Filter based FeatureSelection Input:

A secret shared m × p data matrix [[ D ]] q , a secret shared p -lengthscore vector [[ G ]] q , the number k < p of features to be selected, and a constant t that is bigger than the highest possible score in [[ G ]] q Output: a secret shared m × k matrix [[ D (cid:48) ]] q for i = 1 to k do [[ I [ i ]]] q ← π ARGMIN ([[ G ]] q ) for j ← to p do [[ flag k ]] q ← π EQ ([[ I [ i ]]] q , j ) [[ T [ j ][ i ]]] q ← [[ flag k ]] q [[ G [ j ]]] q ← [[ G [ j ]]] q + π DM ([[ flag k ]] q , t − [[ G [ j ]]] q ) end for end for [[ D (cid:48) ]] q ← π DMM ([[ D ]] q , [[ T ]] q ) return [[ D (cid:48) ]] q IV. M

ETHODOLOGY

We present a protocol for oblivious feature selection basedon precomputed scores for the features, followed by a protocolfor computing the feature scores themselves in a private manner.In Sec. V we evaluate the protocols in 3PC and 4PC honest-majority settings.

A. Secure Filter based Feature Selection

At the start of the Protocol π FILTER − FS for secure featureselection, the parties have secret shares of a data matrix D of size m × p , in which the rows correspond to instances andthe columns to features. The parties also have secret sharesof a vector G of length p containing a score for each of the features. At the end of the protocol, the parties have a reducedmatrix D (cid:48) of size m × k in which only the columns from D corresponding to the lowest scores in G are retained (note thatthis protocol can be trivially modiﬁed to select the k featureswith the highest scores). The main ideas behind the protocol(which is described in Protocol 1) are to:1) Determine the indices of the features that need to be selected(these are stored in a secret-shared way in I ).2) Create a matrix T in which the columns are one-hot-encodedrepresentations of these indices.3) Multiply D with this feature selection matrix T .Before walking through the pseudocode of Protocol 1, wepresent a plaintext example to illustrate the notation. Example 1.

Consider the data matrix D at the left ofEquation (3), containing values for m = 5 instances (rows)and p = 4 features (columns). Assume that the feature scorevector is G = [65 , , , and that we want to select the k = 2 features with the lowest scores in G .  (cid:124) (cid:123)(cid:122) (cid:125) D ·  (cid:124) (cid:123)(cid:122) (cid:125) T =  (cid:124) (cid:123)(cid:122) (cid:125) D (cid:48) (3) The lowest scores in G are 14 and 26, hence the 4th and the2nd column of D should be selected. The columns of T inEquation (3) are a one-hot-encoding of 4 and 2 respectively,and multiplying D with T will yield the desired reduced datamatrix D (cid:48) . This multiplication takes place on Line 9 in Protocol1. The bulk of Protocol 1 is about how to construct T basedon G . As explained below, this process involves an auxiliaryvector, which, at the end of the protocol, contains the followingvalues for our example: I = [4 , .In the protocol, vector [[ I ]] q of length k stores the indicesof the k selected features out of the p features of [[ D ]] q andmatrix [[ T ]] q is a p × k transformation matrix that eventuallyholds one-hot-encodings of the indices in I . Through executingLines 1-8 of Protocol 1, the parties construct a feature selectionmatrix T based on the values in G . In Line 2 the index ofthe i th smallest value in [[ G ]] q is identiﬁed. To this end, theparties run a secure argmin protocol π ARGMIN . The inner for-loop serves two purposes, namely constructing the i th columnof matrix T , and overwriting the score in G of the feature thatwas selected in Line 2 by the upper bound, so that it will notbe selected anymore in further iterations of the outer for-loop(such an upper bound t is passed as input to Protocol 1 and isusually very easy to determine in practice, as most commonfeature scoring techniques range between and ): • To construct the i th column of T , the parties loop throughrow j = 1 . . . p , and on Line 5, update T [ j ][ i ] with either a0 or a 1, depending on the outcome of the secure equalitytest on Line 4. The outcome of this test will be 1 exactlyonce, namely when j equals I [ i ] , hence Line 5 results in aone-hot-encoding of I [ i ] stored in the i th column of T . • The ﬂag f lag k computed on Line 4 is used again on Line 6to overwrite G [ I [ i ]] with t in an oblivious manner, where t is a value that is larger than the highest possible score thatoccurs in [[ G ]] q . This theoretical upper bound t ensures thatfeature I [ i ] will not be selected again in later iterations ofthe outer for-loop. As is common in MPC protocols, we use multiplicationinstead of control ﬂow logic for conditional assignments. To thisend, a conditional based branch operation as “ if c then a ← b ”is rephrased as a ← a + c · ( b − a ) . In this way, the numberand the kind of operations executed by the parties does notdepend on the actual values of the inputs, so it does not leakinformation that could be exploited by side-channel attacks.Such a conditional assignment occurs in Line 6 of Protocol 1,where the value of the condition c itself is computed on Line4. In the ﬁnal step, on Line 9, the parties multiply matrix D with matrix T in a secure manner to obtain a matrix D (cid:48) thatcontains only the feature columns corresponding to the k bestfeatures. Throughout this process, the parties are unaware ofwhich features were actually selected. The secret shared matrix D (cid:48) can subsequently be used as input for a privacy-preservingML model training protocol, e.g. [16]. B. Secure Feature Score Computation

Protocol π FILTER − FS assumes the availability of a featurescore vector G and an upper bound t for the values in G . Below we explain how this can be obtained from thedata in a secure manner. To this end, we present a protocol π MS − GINI for computation of the score of a feature basedon Gini impurity. This protocol is applicable to data setswith continuous features. It is computationally cheaper thanpreviously proposed protocols for Gini impurity that rely onsorting of feature values. Furthermore, as shown in previouswork [25] and in Sec. V, the “Mean-Split” GINI score canyield similar accuracy improvements.Recall that we have a set S of m training examples, whereeach training example consists of an input feature vector ( x , . . . , x p ) and a corresponding label y . We propose to splitthe set of values of the j th feature F j based on its mean valueas a threshold θ . We denote by S ≤ θ the set of instances thathave x j ≤ θ , and by S >θ the set of instances that have x j > θ .Furthermore, for c = 1 , . . . , n , we denote by L c the set ofexamples from S that have class label y = c . Based on thebinary split, we deﬁne the MS-GINI (“Mean-Split” GINI) scorefor feature F j as: G ( F j ) = 1 m · ( | S ≤ θ | · G ( S ≤ θ ) + | S >θ | · G ( S >θ )) (4)with the Gini impurities of S ≤ θ and S >θ deﬁned as: G ( S ≤ θ ) = 1 − n (cid:88) c =1 ( p ≤ θc ) ; G ( S >θ ) = 1 − n (cid:88) c =1 ( p >θc ) (5)and the probabilities deﬁned as: p ≤ θc = | S ≤ θ ∩ L c || S ≤ θ | ; p >θc = | S >θ ∩ L c || S >θ | (6)Formulas (4), (5) and (6) are consistent with the deﬁnition ofGini score given in Sec. II, and presented here in more detail toenhance the readability of our secure protocol π MS − GINI for thecomputation of the Gini score G ( F ) of feature F (describedin Protocol 2).At the start of Protocol π MS − GINI , the parties have secretshares of a feature column F (think of this as a column fromdata matrix D in Example 1), as well as secret shares of an one-hot-encoded version of the label vector. The latter is represented Protocol 2

Protocol π MS − GINI for Secure MS-GINI Score ofa Feature

Input:

A secret shared feature column [[ F ]] q = ( [[ f ]] q , [[ f ]] q ,..., [[ f m ]] q ), asecret shared m × ( n − label-class matrix [[ L ]] q , where m is the numberof instances and n is the number of classes. Output:

MS-GINI score [[ G ( F )]] q of the feature F [[ θ ]] q ← ([[ f ]] q + [[ f ]] q + ... + [[ f m ]] q ) · m

2: Initialize [[ a ]] q , [[ b ]] q , [[ A ]] q and [[ B ]] q with zeros.3: for i ← to m do [[ flag s ]] q ← π LT ([[ θ ]] q , [[ f i ]] q ) [[ b ]] q ← [[ b ]] q + [[ flag s ]] q for j ← to n − do [[ flag m ]] q ← π DM ([[ flag s ]] q , [[ L [ i ][ j ]]] q ) [[ B [ j ]]] q ← [[ B [ j ]]] q + [[ flag m ]] q [[ A [ j ]]] q ← [[ A [ j ]]] q + [[ L [ i ][ j ]]] q − [[ flag m ]] q end for end for [[ a ]] q ← m − [[ b ]] q [[ A [ n ]]] q ← [[ a ]] q − ([[ A [1]]] q + ... + [[ A [ n − q ) [[ B [ n ]]] q ← [[ b ]] q − ([[ B [1]]] q + ... + [[ B [ n − q ) [[ G ( S ≤ θ )]] q ← [[ a ]] q − π DM ( π DP ([[ A ]] q , [[ A ]] q ) , π DIV (1 , [[ a ]] q )) [[ G ( S >θ )]] q ← [[ b ]] q − π DM ( π DP ([[ B ]] q , [[ B ]] q ) , π DIV (1 , [[ b ]] q )) [[ G ( F )]] q ← [[ G ( S ≤ θ )]] q + [[ G ( S >θ )]] q return [[ G ( F )]] q as a label-class matrix [[ L ]] q , in which [[ L [ i ][ j ]]] q = [[1]] q meansthat the label of the i th instance is equal to the j th class.Otherwise, [[ L [ i ][ j ]]] q = [[0]] q . We note that, while there are n classes, it is sufﬁcient for L to contain only n − columns:as there is exactly one value 1 per row, the value of the n th column is implicit from the values of the other columns. Weindirectly take advantage of this fact by terminating the loopon Line 6-10 at n − , and performing calculations for the n th class separately and in a cheaper manner on Line 13-14, aswe explain in more detail below.On Line 1, the parties compute [[ θ ]] q as a threshold to splitthe input feature [[ F ]] q , as the mean of the feature values inthe column. To this end, each party ﬁrst sums up the secretshares of the feature values, and then multiplies the sum witha known constant m locally. Line 2 is to initialize all countersrelated to S ≤ θ and S >θ to zero. After Line 14, these counterswill contain the following values: a = | S ≤ θ | b = | S >θ | A [ j ] = | S ≤ θ ∩ L j | , for j = 1 . . . nB [ j ] = | S >θ ∩ L j | , for j = 1 . . . n These counters are needed for the probabilities in Equation(6). For each instance, in Line 4 of Protocol 2, the partiesperform a secure comparison to determine whether the instancebelongs to S >θ . The outcome of that test is added to b onLine 5. Since the total number of instances is m , a can bestraightforwardly computed as m − b after the outer for-loop,i.e. on Line 12. Lines 7-8 check whether the instance belongsto S >θ ∩ L j , in which case B [ j ] is incremented by 1. Theequivalent operation of Line 7-8 for A [ j ] would be [[ A [ j ]]] q ← [[ A [ j ]]] q + π DM ((1 − [[ f lag s ]] q ) , [[ L [ i ][ j ]]] q ) . We have simpliﬁedthis instruction on Line 9, taking advantage of the fact that π DM ([[ f lag s ]] q , [[ L [ i ][ j ]]] q ) has been precomputed as [[ f lag m ]] q on Line 7.On Line 13-14 the parties compute [[ A [ n ]]] q and [[ B [ n ]]] q ,leveraging the fact that sum of all values in [[ A ]] q is [[ a ]] q , andthe sum of all values in [[ B ]] q is [[ b ]] q . All operations on Line 13-14 can be performed locally by the parties, on their ownshares. Moving the computation of [[ A [ n ]]] q and [[ B [ n ]]] q outof the for-loop, reduces the number of secure multiplicationsneeded from m × n to m × ( n − . In the case of a binaryclassiﬁcation problem, i.e. n = 2 , this means that the numberof secure multiplications required is cut down by half.Using the notations for the counters from the pseudocodeof Protocol 2, Equation (4) comes down to: G ( F ) = 1 m ·  a ·  − n (cid:88) j =1 (cid:18) A [ j ] a (cid:19)  + b ·  − n (cid:88) j =1 (cid:18) B [ j ] b (cid:19)  = 1 m · (cid:20)(cid:18) a − a · A • A (cid:19) + (cid:18) b − b · B • B (cid:19)(cid:21) in which A • A and B • B are the dot products of A and B withthemselves, respectively. These computations are performedby the parties on Lines 15-17 using, among other things, theprotocol π DP for secure dot product of vectors, and the protocol π DIV for secure division. We note that the ﬁnal multiplicationwith the factor /m is omitted altogether from Protocol 2 asthis will have no effect on the relative ordering of the scoresof the individual features.If data are vertically partitioned and all data owners have thelabel vector, they can compute MS-GINI scores ofﬂine without π MS − GINI , and the computing servers would only have to dofeature selection based on pre-computed MS-GINI scores withProtocol π FILTER − FS . In reality, often, it is not reasonable toallow each data owner to have all labels, so we do not assumethis scenario in our protocols. C. Secure Feature Selection with MS-GINI

Protocol π GINI − FS (described in Protocol 3) performs secureﬁlter-based feature selection with MS-GINI, used for theexperiments in this work. It combines the building blockspresented earlier in the section. By executing the loop on Line1-3, the parties compute the MS-GINI score of the i th featurefrom the original data matrix [[ D ]] q using Protocol π MS − GINI ,and store it into [[ G [ i ]]] q . On Line 4, the parties perform ﬁlter-based feature selection using Protocol π FILTER − FS to obtain a m × k matrix [[ D (cid:48) ]] q with k selected features from [[ D ]] q . Asthe standard GINI score is upper bounded by 1, and π MS − GINI ignores the multiplication by /m for efﬁciency reasons, it issafe to use m as the upper bound that is passed to Protocol π FILTER − FS on Line 4. Protocol 3

Protocol π GINI − FS for Secure Filter-based FeatureSelection with MS-GINI Input:

A secret shared m × p data matrix [[ D ]] q = ( [[ F ]] q , [[ F ]] q ,..., [[ F p ]] q ),a secret shared m × ( n − label-class matrix [[ L ]] q , where m is the numberof instances, p the number of features, n the number of classes, and k thenumber of features to be selected. Output: a secret shared m × k matrix [[ D (cid:48) ]] q for i ← to p do [[ G [ i ]]] q ← π MS − GINI ([[ F i ]] q , [[ L ]] q , m, n ) end for [[ D (cid:48) ]] q ← π FILTER − FS ([[ D ]] q , [[ G ]] q , k, m ) return [[ D (cid:48) ]] q V. E

XPERIMENTS AND R ESULTS

The ﬁrst four columns of Table I contain details for three datasets corresponding to binary classiﬁcation tasks with continuous

TABLE IF

EATURE SELECTION ACCURACY AND RUNTIME RESULTS data set details logistic regression accuracy results runtimeData set m p k

TABLE IIR

UNTIME DETAILS FOR ACTIVE data set details runtimeData set m p k

Prot 1 Prot 1, Ln 9 Prot 2CogLoad 632 120 12 27 sec 23 sec 1.13 secLSVT 126 310 103 152 sec 53 sec 0.33 secSPEED 8,378 122 67 1,837 sec 1,812 sec 14.73 sec valued input features: Cognitive Load Detection (CogLoad)[19], Lee Silverman Voice Treatment (LSVT) [33], and SpeedDating (SPEED) [18], along with the number of instances m , raw features p , selected features k , and folds for cross-validation (CV). The middle ﬁve columns of Table I containaccuracy results by averaging from CV for logistic regression(LR) models trained on the RAW data sets with all p features,and on reduced data sets with only the top k features selectedwith a variety of scoring techniques, namely MS-GINI (asproposed in this paper), traditional Gini impurity (GI), Pearsoncorrelation coefﬁcient (PCC), and mutual information (MI).Feature selection with all these techniques was performedaccording to the ﬁlter approach, i.e. independently of the factthat the selected features were subsequently used to train aLR model. As the results show, feature selection based onMS-GINI is on par with the other methods, and substantiallyimproves the accuracy compared to model training on the RAWdata sets.The last three columns of Table I contain runtime resultsfor protocol π GINI − FS for secure ﬁlter-based feature selectionwith MS-GINI (see Protocol 3). To obtain these results, weimplemented π GINI − FS along with the supporting protocols π MS − GINI and π FILTER − FS in MP-SPDZ [23]. All benchmarktests were completed on 3 or 4 co-located F32s V2 Azure virtualmachines. Each VM contains 32 cores, 64 GiB of memory,and up to a 14 Gbps network bandwidth between each virtualmachine. The runtime results are for semi-honest (“passive”)and malicious (“active”) adversary models (see Sec. II-B) ina 3PC or 4PC honest-majority setting over a ring Z q with q = 2 . Each of the parties ran on separate machines, whichmeans that the results in Table I cover communication time inaddition to computation time. Similarly as for the accuracies,the reported runtimes in Table I are an average across thefolds. The relative differences between the passive 3PC, active3PC , and active 4PC settings are in line with known ﬁndingsfrom the MPC literature, in particular the fact that completingprivate feature selection in the active setting takes substantiallylonger than in the passive setting; this increase in runtime is aprice one has to pay for security and correctness in case theparties can not be trusted to follow the protocol instructions. https://archive.ics.uci.edu/ml/datasets/LSVT+Voice+Rehabilitation For further insight in the dominating factors in the runtimecost, in Table II we present more ﬁne-grained runtime resultsfor the active 3PC setting. Protocol 2, which is executed onceper feature, in itself grows in the number of instances m . Whilethe nested for-loop on Line 1-8 in Protocol 1 depends on k and p only, the matrix multiplication on Line 9 in Protocol 1depends on all of m , p , and k , and contributes substantially tothe runtime. The increase in runtime for the SPEED vs. theCogLoad data set e.g., which have almost the same numberof original features p , is due both to the increase in m (whichaffects Line 9 in Protocol 1, and Line 3-11 in Protocol 2), andthe increase in k (which affects Line 1-8 of Protocol 1).VI. C ONCLUSION AND F UTURE W ORK

Data preprocessing, an important part of the ML modeldevelopment pipeline, has been largely overlooked in the PPMLliterature to date. In this paper we have proposed an MPCprotocol for privacy-preserving selection of the top k featuresof a data set, and we have demonstrated its feasibility inpractice through an experimental evaluation. Our protocol isbased on the ﬁlter approach for feature selection, which meansthat it is independent of any speciﬁc ML model architecture.Furthermore, it can be used in combination with any featurescoring technique. In this paper, we have proposed an efﬁcientMPC protocol based on Gini impurity to this end.In addition to MPC protocols for other feature selectiontechniques, MPC protocols for many more tasks related to thedata preprocessing phase still need to be developed, includingprivacy-preserving hyperparameter search to determine the bestvalue of k for the number of features to be selected, as well asprotocols for dealing with outliers and missing values. Whilethese may be perceived as less exciting tasks of the ML end-to-end pipeline, they are crucial to enable PPML applicationsin practical data science.R EFERENCES[1] Mark Abspoel, Daniel Escudero, and Nikolaj Volgushev. Secure trainingof decision trees with continuous attributes. In

Proceedings on PrivacyEnhancing Technologies (PoPETs) , pages 167–187, 2021.[2] Anisha Agarwal, Rafael Dowsley, Nicholas D McKinney, DongruiWu, Chin Teng Lin, Martine De Cock, and Anderson Nascimento.Protecting privacy of users in brain-computer interface applications.

IEEE Transactions on Neural Systems and Rehabilitation Engineering ,27(8):1546–1555, 2019.[3] Nitin Agrawal, Ali Shahin Shamsabadi, Matt J Kusner, and Adri`a Gasc´on.QUOTIENT: two-party secure neural network training and prediction. In

ACM SIGSAC Conference on Computer and Communications Security ,pages 1231–1247, 2019.[4] Toshinori Araki, Jun Furukawa, Yehuda Lindell, Ariel Nof, and KazumaOhara. High-throughput semi-honest secure three-party computation withan honest majority. In

Proceedings of the 2016 ACM SIGSAC Conferenceon Computer and Communications Security , page 805–817, 2016.[5] Madhushri Banerjee and Sumit Chakravarty. Privacy preserving featureselection for distributed data using virtual dimension. In

Proceedings ofthe 20th ACM International Conference on Information and KnowledgeManagement , pages 2281–2284, 2011. [6] Dan Bogdanov, Sven Laur, and Riivo Talviste. Oblivious sorting ofsecret-shared data.

Technical Report , 2013.[7] Leo Breiman, Jerome Friedman, Charles Stone, and Richard Olshen.

Classiﬁcation and Regression Trees . Taylor and Francis, 1st edition,1984.[8] O. Catrina and S. De Hoogh. Improved primitives for secure multipartyinteger computation. In

International Conference on Security andCryptography for Networks , pages 182–199. Springer, 2010.[9] O. Catrina and A. Saxena. Secure computation with ﬁxed-point numbers.In , volume 6052 of

Lecture Notes in Computer Science , pages35–50. Springer, 2010.[10] Girish Chandrashekar and Ferat Sahin. A survey on feature selectionmethods.

Computers & Electrical Engineering , 40(1):16 – 28, 2014.[11] C.A. Choudhary, M. De Cock, R. Dowsley, A. Nascimento, andD. Railsback. Secure training of extra trees classiﬁers over continuousdata. In

AAAI-20 Workshop on Privacy-Preserving Artiﬁcial Intelligence ,2020.[12] Ronald Cramer, Ivan Bjerre Damgard, and Jesper Buus Nielsen.

SecureMultiparty Computation and Secret Sharing . Cambridge University Press,1st edition, 2015.[13] A. Dalskov, D. Escudero, and M. Keller. Fantastic four: Honest-majorityfour-party secure computation with malicious security. Cryptology ePrintArchive, Report 2020/1330, 2020.[14] A. Dalskov, D. Escudero, and M. Keller. Secure evaluation of quantizedneural networks.

Proceedings on Privacy Enhancing Technologies ,2020(4):355–375, 2020.[15] Martine De Cock, Rafael Dowsley, Anderson C. A. Nascimento, andStacey C. Newman. Fast, privacy preserving linear regression overdistributed datasets based on pre-distributed data. In , page 3–14, 2015.[16] Martine De Cock, Rafael Dowsley, Anderson C. A. Nascimento, DavisRailsback, Jianwei Shen, and Ariel Todoki. High performance logisticregression for privacy-preserving genome analysis.

BMC MedicalGenomics , 14(1):23, 2021.[17] Sebastiaan De Hoogh, Berry Schoenmakers, Ping Chen, and Harmop den Akker. Practical secure decision tree learning in a teletreatmentapplication. In

International Conference on Financial Cryptography andData Security , pages 179–194. Springer, 2014.[18] Raymond Fisman, Sheena S. Iyengar, Emir Kamenica, and ItamarSimonson. Gender differences in mate selection: Evidence from a speeddating experiment.

The Quarterly Journal of Economics , 121(2):673–697,2006.[19] Martin Gjoreski, Tine Kolenik, Timotej Knez, Mitja Luˇstrek, MatjaˇzGams, Hristijan Gjoreski, and Veljko Pejovi´c. Datasets for cognitiveload inference using wearable sensors and psychological traits.

AppliedSciences , 10(11):38–43, 2020.[20] M. Goodrich. Zig-zag sort: A simple deterministic data-oblivious sortingalgorithm running in o(n log n) time. In

Proceedings of the 46th AnnualACM Symposium on Theory of Computing , pages 684–693, 2014.[21] Chuan Guo, Awni Hannun, Brian Knott, Laurens van der Maaten, MarkTygert, and Ruiyu Zhu. Secure multiparty computations in ﬂoating-pointarithmetic. arXiv preprint arXiv:2001.03192 , 2020.[22] Yasser Jafer, Stan Matwin, and Marina Sokolova. A framework fora privacy-aware feature selection evaluation measure. In , pages 62–69. IEEE,2015.[23] Marcel Keller. MP-SPDZ: A versatile framework for multi-partycomputation. In

Proceedings of the 2020 ACM SIGSAC Conferenceon Computer and Communications Security , page 1575–1590, 2020.[24] N. Kumar, M. Rathee, N. Chandran, D. Gupta, A. Rastogi, and R. Sharma.CrypTFlow: Secure TensorFlow inference. In , 2020.[25] Xiling Li and Martine De Cock. Cognitive load detection from wrist-bandsensors. In

Adjunct Proceedings of the 2020 ACM International JointConference on Pervasive and Ubiquitous Computing and Proceedings ofthe 2020 ACM International Symposium on Wearable Computers , page456–461, 2020.[26] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. In

Annual International Cryptology Conference , pages 36–54. Springer,2000.[27] Steven Lohr. For big-data scientists, ‘janitor work’ is key hurdle toinsights. The New York Times, 2014.[28] P. Mohassel and Y. Zhang. Secureml: A system for scalable privacy-preserving machine learning. In

IEEE Symposium on Security andPrivacy (SP) , pages 19–38, 2017.[29] Valeria Nikolaenko, Udi Weinsberg, Stratis Ioannidis, Marc Joye, DanBoneh, and Nina Taft. Privacy-preserving ridge regression on hundreds of millions of records. In

IEEE Symposium on Security and Privacy(SP) , pages 334–348, 2013.[30] Vanishree Rao, Yunhui Long, Hoda Eldardiry, Shantanu Rane, Ryan A.Rossi, and Frank Torres. Secure two-party feature selection. arXivpreprint arXiv:1901.00832 , 2019.[31] M.S. Riazi, C. Weinert, O. Tkachenko, E.M. Songhori, T. Schneider, andF. Koushanfar. Chameleon: A hybrid secure computation framework formachine learning applications. In

Asia Conference on Computer andCommunications Security , pages 707–721, 2018.[32] Mina Sheikhalishahi and Fabio Martinelli. Privacy-utility feature selectionas a privacy mechanism in collaborative data classiﬁcation. In

IEEE26th International Conference on Enabling Technologies: Infrastructurefor Collaborative Enterprises (WETICE) , pages 244–249, 2017.[33] Athanasios Tsanas, Max A. Little, Cynthia Fox, and Lorraine O.Ramig. Objective automatic assessment of rehabilitative speech treatmentin parkinson’s disease.

IEEE Transactions on Neural Systems andRehabilitation Engineering , 22(1):181–190, 2014.[34] Sameer Wagh, Divya Gupta, and Nishanth Chandran. SecureNN: 3-partysecure computation for neural network training.

Proceedings on PrivacyEnhancing Technologies (PoPETs) , 2019(3):26–49, 2019.[35] Xiucai Ye, Hongmin Li, Akira Imakura, and Tetsuya Sakurai. Distributedcollaborative feature selection based on intermediate representation. In