[PDF] Multinomial Random Forest: Toward Consistency and Privacy-Preservation

Abstract

Despite the impressive performance of random forests (RF), its theoretical properties have not been thoroughly understood. In this paper, we propose a novel RF framework, dubbed multinomial random forest (MRF), to analyze the \emph{consistency} and \emph{privacy-preservation}. Instead of deterministic greedy split rule or with simple randomness, the MRF adopts two impurity-based multinomial distributions to randomly select a split feature and a split value respectively. Theoretically, we prove the consistency of the proposed MRF and analyze its privacy-preservation within the framework of differential privacy. We also demonstrate with multiple datasets that its performance is on par with the standard RF. To the best of our knowledge, MRF is the first consistent RF variant that has comparable performance to the standard RF.

Full PDF

MMultinomial Random Forest: Toward Consistencyand Privacy-Preservation

Yiming Li , ∗ , Jiawang Bai , , ∗ , Jiawei Li , Yang Xue , Yong Jiang , , Chun Li , Shutao Xia , Tsinghua Shenzhen International Graduate School, Tsinghua University, China PCL Research Center of Networks and Communications, Peng Cheng Laboratory, China Institute for Computational Biology, Case Western Reserve University, Ohio, USA [email protected] ; [email protected] Abstract

Despite the impressive performance of random forests (RF), its theoretical proper-ties have not been thoroughly understood. In this paper, we propose a novel RFframework, dubbed multinomial random forest (MRF), to analyze the consistency and privacy-preservation . Instead of deterministic greedy split rule or with simplerandomness, the MRF adopts two impurity-based multinomial distributions torandomly select a split feature and a split value respectively. Theoretically, weprove the consistency of the proposed MRF and analyze its privacy-preservationwithin the framework of differential privacy. We also demonstrate with multipledatasets that its performance is on par with the standard RF. To the best of ourknowledge, MRF is the ﬁrst consistent RF variant that has comparable performanceto the standard RF.

Random forest (RF) [1] is a popular type of ensemble learning method. Because of its excellentperformance and fast yet efﬁcient training process, the standard RF and its several variants have beenwidely used in many ﬁelds, such as computer vision [2, 3] and data mining [4, 5]. However, due tothe inherent bootstrap randomization and the highly greedy data-dependent construction process, it isvery difﬁcult to analyze the theoretical properties of random forests [6], especially for the consistency .Since consistency ensures that the model goes to optimal under a sufﬁcient amount of data, thisproperty is especially critical in this big data era.To address this issue, several RF variants [7, 8, 9, 6, 10, 11] were proposed. Unfortunately, all existingconsistent RF variants suffer from relatively poor performance compared with the standard RF due totwo mechanisms introduced for ensuring consistency. On the one hand, the data partition processallows only half of the training samples to be used for the construction of tree structure, which signiﬁ-cantly reduces the performance of consistent RF variants. On the other hand, extra randomness ( e.g. ,Poisson or Bernoulli distribution) is introduced, which further hinders the performance. Accordingly,those mechanisms introduced for theoretical analysis makes it difﬁcult to eliminate the performancegap between consistent RF and standard RF.Is this gap really impossible to ﬁll? In this paper, we propose a novel consistent RF framework,dubbed multinomial random forest (MRF), by introducing the randomness more reasonably, as shownin Figure 1. In the MRF, two impurity-based multinomial distributions are used as the basis forrandomly selecting a split feature and a speciﬁc split value respectively. Accordingly, the “best” splitpoint has the highest probability to be chosen, while other candidate split points that are nearly asgood as the “best” one will also have a good chance to be selected. This randomized splitting process ∗ equal contribution.Preprint. Correspondence to: Xue Yang and Shutao Xia. a r X i v : . [ c s . L G ] J un igure 1: Splitting criteria of different RFs. For standard RF, it always chooses the split point withthe highest impurity decrease. For Denil14 and BRF, they choose the split point also in a greedyway mostly, while holding a small or even negligible probability in selecting another point randomly.The selection probability in MRF is positively related to the impurity decrease. All randomness inconsistent RF variants is aiming to fulﬁll the consistency, whereas the one in MRF is more reasonable.is more reasonable and makes up the accuracy drop with almost no extra computational complexity.Besides, the introduced impurity-based randomness is essentially an exponential mechanism satisfy-ing differential privacy, and the randomized prediction of each tree proposed in this paper also adoptsthe exponential mechanism. Accordingly, we can also analyze the privacy-preservation of MRF underthe differential privacy framework. To the best of our knowledge, this privacy-preservation property,which is important since the training data may well contains sensitive information, has never beenanalyzed by previous consistent RF variants.The main contributions of this work are three-fold: we propose a novel multinomial-based methodto improve the greedy split process of decision trees; we propose a new random forests variant,dubbed multinomial random forest (MRF), based on which we analyze its consistency and privacy-preservation; extensive experiments demonstrate that the performance of MRF is on par withBreiman’s original RF and is better than all existing consistent RF variants. MRF is the ﬁrst consistentRF variant that simultaneously has performance comparable to the standard RF. Random forest [1] is a distinguished ensemble learning algorithm inspired by the random subspacemethod [12] and random split selection [13]. The standard decision trees are built upon bootstrapdatasets and splitting with the CART methodology [14]. Its various variants, such as quantileregression forests [15] and deep forests [16], were proposed and used in a wide range of applications[4, 2, 3] for their effective training process and great interpretability. Despite the widespread useof random forests in practice, theoretical analysis of their success has yet been fully established.Breiman [1] showed the ﬁrst theoretical result indicating that the generalization error is bounded bythe performance of individual tree and the diversity of the whole forest. After that, the relationshipbetween random forests and a type of nearest neighbor-based estimator was also studied [17].One of the important properties, the consistency, has yet to be established for random forests.Consistency ensures that the result of RF converges to the optimum as the sample size increases,which was ﬁrst discussed by Breiman [7]. As an important milestone, Biau [8] proved the consistencyof two directly simpliﬁed random forest. Subsequently, several consistent RF variants were proposedfor various purposes; for example, random survival forests [18], an online version of random forestsvariant [19] and a generalized regression forests [20]. Recently, Haghiri [21] proposed CompRF,whose split process relied on triplet comparisons rather than information gain. To ensure theconsistency, [6] suggested that an independent dataset is needed to ﬁt in the leaf. This approach iscalled data partition. Under this framework, Denil [10] developed a consistent RF variant (called

Denil14 in this paper) narrowing the gap between theory and practice. Following

Denil14 , Wang[11] introduced Bernoulli random forests (BRF), which reached state-of-the-art performance. Thecomparison of different consistent RFs is included in the

Appendix .2lthough several consistent RF variants are proposed, due to the relatively poor performance com-pared with standard RF, how to fulﬁll the gap between theoretical consistency and the performance inpractice is still an important open problem.

In addition to the exploration of consistency, some schemes [22, 23] were also presented to addressprivacy concerns. Among those schemes, differential privacy [24], as a new and promising privacy-preservation model, has been widely adopted in recent years. In what follows, we outline the basiccontent of differential privacy.Let D = { ( X i , Y i ) } ni =1 denotes a dataset consisting of n i.i.d. observations, where X i ∈ R D repre-sents D -dimensional features and Y i ∈ { , . . . , K } indicates the label. Let A = { A , A , . . . , A D } represents the feature set. The formal deﬁnition of differential privacy is detailed as follow. Deﬁnition 1 ( (cid:15) -Differential Privacy ) . A randomized mechanism M gives (cid:15) -differential privacy forevery set of outputs O and any neighboring datasets D and D (cid:48) differing in one record, if M satisﬁes: Pr[ M ( D ) ∈ O ] (cid:54) exp ( (cid:15) ) · Pr[ M ( D (cid:48) ) ∈ O ] , (1)where (cid:15) denotes the privacy budget that restricts the privacy guarantee level of M . A smaller (cid:15) represents a stronger privacy level.Currently, two basic mechanisms are widely used to guarantee differential privacy: the Laplacemechanism [25] and the Exponential mechanism [26], where the former one is suitable for numericqueries and the later is suitable for non-numeric queries. Since the MRF mainly involves selectionoperations, we adopt the exponential mechanism to preserve privacy. Deﬁnition 2 (Exponential Mechanism) . Let q : ( D , o ) → R be a score function of dataset D thatmeasures the quality of output o ∈ O . The exponential mechanism M ( D ) satisﬁes (cid:15) -differentialprivacy, if it outputs o with probability proportional to exp (cid:16) (cid:15)q ( D ,o )2 (cid:52) q (cid:17) , i.e. , Pr[ M ( D ) = o ] = exp (cid:16) (cid:15)q ( D ,o )2 (cid:52) q (cid:17)(cid:80) o (cid:48) ∈ O exp (cid:16) (cid:15)q ( D ,o (cid:48) )2 (cid:52) q (cid:17) , (2)where (cid:52) q is the sensitivity of the quality function, deﬁned as (cid:52) q = max ∀ o, D , D (cid:48) | q ( D , o ) − q ( D (cid:48) , o ) | . (3) Compared to the standard RF, the MRF replaces the bootstrap technique by a partition of the trainingset, which is necessary for consistency, as suggested in [6]. Speciﬁcally, to build a tree, the trainingset D is divided randomly into two non-overlapping subsets D S and D E , which play different roles.One subset D S will be used to build the structure of a tree; we call the observations in this subset the structure points . Once a tree is built, the labels of its leaves will be re-determined on the basis ofanother subset D E ; we call the observations in the second subset the estimation points . The ratio oftwo subsets is parameterized by partition rate = | Structure points | / | Estimation points | . To buildanother tree, the training set is re-partitioned randomly and independently. The construction of a tree relies on a recursive partitioning algorithm. Speciﬁcally, to split a node, weintroduce two impurity-based multinomial distributions: one for split feature selection and anotherfor split value selection. The speciﬁc split point is a pair of a split feature and a split value. In theclassiﬁcation problem, the impurity decrease at a node u caused by a split point v is deﬁned as I ( D Su , v ) = T ( D Su ) − |D S l u ||D Su | T ( D S l u ) − |D S r u ||D Su | T ( D S r u ) , (4)3here D Su is the subset of D S at a node u , D S l u and D S r u generated by splitting D Su with v , are twosubsets in the left child and right child of the node u , respectively, and T ( · ) is the impurity criterion( e.g. , Shannon entropy or Gini index). Unless other speciﬁcation, we ignore the subscript u of eachsymbol, and use I to denote I ( D Su , v ) for shorthand in the rest of this paper.Let V = { v ij } denote the set of all possible split points for the node and I i,j is the correspondingimpurity decrease, where v ij is i -th value on the j -th feature. In what follows, we ﬁrst introducethe feature selection mechanism for a node, and then describe the split value selection mechanismcorresponding to the selected feature. M ( φ ) -based split feature selection. At ﬁrst, we obtain a vector I = ( I , · · · , I D ) = (cid:16) max i { I i, } , · · · , max i { I i,D } (cid:17) based on each I i,j , where max i { I i,j } , j = 1 , . . . , D , is the largestpossible impurity decrease of the feature A j . Then, the following three steps need to be performed: • Normalize I : ˆ I = (cid:16) I − min I max I − min I , · · · , I D − min I max I − min I (cid:17) ; • Compute the probabilities φ = ( φ , · · · , φ D ) = softmax ( B ˆ I ) , where B ≥ is a hyper-parameter related to privacy budget; • Randomly select a feature according to the multinomial distribution M ( φ ) . M ( ϕ ) -based split value selection. After selecting the feature A j for a node, we need to determinethe corresponding split value to construct two children. Suppose A j has m possible split values, weneed to perform the following steps: • Normalize I ( j ) = ( I ,j , · · · , I m,j ) as ˆ I ( j ) = (cid:16) I ,j − min I ( j ) max I ( j ) − min I ( j ) , · · · , I m,j − min I ( j ) max I ( j ) − min I ( j ) (cid:17) ,where j identiﬁes the feature A j ; • Compute the probabilities ϕ = ( ϕ , · · · , ϕ m ) = softmax ( B ˆ I ( j ) ) , where B ≥ isanother hyper-parameter related to privacy budget; • Randomly select a split value according to the multinomial distribution M ( ϕ ) .We repeat the above processes to split nodes until the stopping criterion is met. The stopping criterionrelates to the minimum leaf size k . Speciﬁcally, the number of estimation points is required to be atleast k for every leaf. The pseudo code of training process is shown in the Appendix . Once a tree h was grown based on D S , we re-determine the predicted values for leaves according to D E . Similar to [1], given an unlabeled sample x , we can easily know which leaf of h it falls, and theempirical probability that sample x has label c ( c ∈ { , · · · , K } ) is estimated to be η ( c ) ( x ) = 1 |N Eh ( x ) | (cid:88) ( X ,Y ) ∈N Eh ( x ) I { Y = c } , (5)where N Eh ( x ) is the set of estimation points in the leaf containing x , and I ( · ) is an indicator function.In contrast to the standard RF and consistent RF variants, the predicted label h ( x ) of x is randomlyselected with a probability proportional to exp (cid:16) B η ( c ) ( x )2 (cid:17) , where B ≥ is also related to theprivacy budget. The ﬁnal prediction of the MRF is the majority vote over all the trees, which is thesame as the one used in [1]: ˆ y = h ( M ) ( x ) = arg max c (cid:88) i I (cid:110) h ( i ) ( x ) = c (cid:111) . (6) In this section, we theoretically analyze the consistency and privacy-preservation of the MRF. Allomitted proofs are shown in the

Appendix . 4 .1 Consistency4.1.1 PreliminariesDeﬁnition 3.

When the dataset D is given, for a certain distribution of ( X , Y ) , a sequence ofclassiﬁers { h } are consistent if the error probability L satisﬁes E ( L ) = Pr( h ( X , Z, D ) (cid:54) = Y ) → L ∗ , where L ∗ denotes the Bayes risk, Z denotes the randomness involved in the construction of the tree,such as the selection of candidate features. Lemma 1.

The voting classiﬁer h ( M ) which takes the majority vote over M copies of h with differentrandomizing variables has consistency if those classiﬁers { h } have consistency. Lemma 2.

Consider a partitioning classiﬁcation rule building a prediction by a majority vote methodin each leaf node. If the labels of the voting data have no effect on the structure of the classiﬁcationrule, then E [ L ] → L ∗ as n → ∞ provided that1. The diameter of N ( X ) → as n → ∞ in probability,2. |N E ( X ) | → ∞ as n → ∞ in probability,where N ( X ) is the leaf containing X , |N E ( X ) | is the number of estimation points in N ( X ) . Lemma 1 [8] states that the consistency of individual trees leads the consistency of the forest. Lemma2 [27] implies that the consistency of a tree can be ensured as n → ∞ , every hypercube at a leaf issufﬁciently small but still contains inﬁnite number of estimation points. In general, the proof of consistency has three main steps: (1) each feature has a non-zero probabilityto be selected, (2) each split reduces the expected size of the split feature, and (3) split processcan go on indeﬁnitely. We ﬁrst propose two lemmas for step (1) and (2) respectively, and then theconsistency theorem of the MRF.

Lemma 3.

In the MRF, the probability that any given feature A is selected to split at each node haslower bound P > . Lemma 4.

Suppose that features are all supported on [0 , . In the MRF, once a split feature A isselected, if this feature is divided into N ( N ≥ equal partitions A (1) , · · · , A ( N ) from small tolarge ( i.e. , A ( i ) = (cid:2) i − N , iN (cid:3) ), for any split point v , ∃ P ( P > , s . t . Pr (cid:32) v ∈ N − (cid:91) i =2 A ( i ) | A (cid:33) ≥ P . Lemma 3 states that the MRF fulﬁlls the ﬁrst aforementioned requirement. Lemma 4 states thatsecond condition is also met by showing that the speciﬁc split value has a large probability that it isnot near the two endpoints of the feature interval.

Theorem 1.

Suppose that X is supported on [0 , D and have non-zero density almost everywhere,the cumulative distribution function of the split points is right-continuous at 0 and left-continuous at1. If B → ∞ , MRF is consistent when k → ∞ and k/n → as n → ∞ . In this part, we prove that the MRF satisﬁes (cid:15) -differential privacy based on two composition properties[28]. Suppose we have a set of privacy mechanisms M = {M , . . . , M p } and each M i provides (cid:15) i privacy guarantee, then the sequential composition and parallel composition are described as follows: Property 1 (Sequential Composition) . Suppose M = {M , . . . , M p } are sequentially performedon a dataset D , then M will provide ( (cid:80) pi =1 (cid:15) i ) -differential privacy. Property 2 (Parallel Composition) . Suppose M = {M , . . . , M p } are performed on a disjointedsubsets of the entire dataset, i.e. , {D , . . . , D p } , respectively, then M will provide (max { (cid:15) i } pi =1 ) -differential privacy. D ATASET

Denil14

BRF C

OMP

RF-C

MRF C OMP

RF-I

Breiman Z OO † AYES † CHO † EPATITIS

DBC

RANSFUSION . • V EHICLE . • M AMMO . • M ESSIDOR . • W EBSITE † ANKNOTE † MC † EAST AR . • I MAGE

HESS † DS . • W ILT

INE -Q UALITY . • P HISHING † URSERY † ONNECT -4 66.19 76.75 72.82 † VERAGE R ANK

Lemma 5.

The impurity-based multinomial distribution M ( φ ) of feature selection is essentially theexponential mechanism of differential privacy, and satisﬁes B -differential privacy. Lemma 6.

The impurity-based multinomial distribution M ( ϕ ) of split value selection is essentiallythe exponential mechanism of differential privacy, and satisﬁes B -differential privacy. Lemma 7.

The label selection of each leaf in a tree satisﬁes B -differential privacy. Based on the above properties and lemmas, we can obtain the following theorem:

Theorem 2.

The proposed MRF satisﬁes (cid:15) -differential privacy when the hyper-parameters B , B and B satisfy B + B = (cid:15)/ ( d · t ) and B = (cid:15)/t , where t is the number of trees and d is the depthof a tree such that d (cid:54) O ( |D E | k ) . . We conduct experiments on twenty-two UCI datasets used in previous consistentRF works [10, 11, 21]. The speciﬁc description of used datasets is shown in the Appendix . Baselines . We select

Denil14 [10], BRF [11] and CompRF [21] as the baseline methods in thefollowing evaluations. Those methods are the state-of-the-art consistent random forests variants.Speciﬁcally, we evaluate two different CompRF variants proposed in [21], including consistentCompRF (CompRF-C) and inconsistent CompRF (CompRF-I). Besides, we provide the results ofstandard RF (

Breiman ) [1] as another important baseline for comparison.

Training Setup . We carry out 10 times 10-fold cross validation to generate 100 forests for eachmethod. All forests have t = 100 trees, minimum leaf size k = 5 . Gini index is used as the impuritymeasure except for CompRF. In Denil14 , BRF, CompRF, and RF, we set the size of the set ofcandidate features √ D . The partition rate of all consistent RF variants is set to . All settings statedabove are based on those used in [10, 11]. In MRF, we set B = B = 10 and B → ∞ in alldatasets, and the hyper-parameters of baseline methods are set according to their paper. Results . Table 1 shows the average test accuracy. Among the four consistent RF variants, the one withthe highest accuracy is indicated in boldface. In addition, we carry out Wilcoxon’s signed-rank test6 a) (b) (c)

Figure 2: Visualization result of the proposed MRF. (a) : Aerial image; (b) : Groud-truth; (c) : Theheat map of the prediction. The pixel is predicted as within the building area if and only if its color isred in the heat map. (a) (b)

Figure 3: The pixel-wise accuracy (PA) and the intersection over union (IOU) of different methods.The standard deviation is indicated by the error bar.[29] to test for the difference between the results from the MRF and the standard RF at signiﬁcancelevel 0.05. Those for which the MRF is signiﬁcantly better than the standard RF are marked with " † ".Conversely, those for which RF is signiﬁcantly better are marked with " • ". Moreover, the last lineshows the average rank of different methods across all datasets.As shown in Table 1, MRF signiﬁcantly exceeds all existing consistent RF variants. For example,MRF achieves more than improvement in most cases, compared with the current state-of-the-artmethod. Besides, the performance of the MRF even surpasses Breiman’s original random forest intwelve of the datasets, and the advantage of the MRF is statistically signiﬁcant in ten of them. To thebest of our knowledge, this has never been achieved by any other consistent random forest methods.Note that we have not ﬁne-tuned the hyper-parameters such as B , B and t . The performance of theMRF might be further improved with the tuning of these parameters, which would bring additionalcomputational complexity. . We treat the segmentation as a pixel-wise classiﬁcation and build the datasetbased on aerial images . Each pixel of these images are labeled for one of two semantic classes: building or not building . Except for the RGB values of each pixel, we also construct some otherwidely used features. Speciﬁcally, we adopt local binary pattern [30] with radius 24 to characterizetexture, and calculate eight Haralick features [31] (including angular second moment, contrast,correlation, entropy, homogeneity, mean, variance, and standard deviation). We sample 10,000 pixelswithout replacement for training, and test the performance on the test image. To reduce the effect ofrandomness, we repeat the experiments 5 times with different training set assignments. Besides, allsettings are the same as that of Section 5.1 unless otherwise speciﬁed. https://github.com/dgriffiths3/ml_segmentation a) ECHO (b) CMC (c) ADS(d) WDBC (e) CAR (f) CONNECT-4 Figure 4: Accuracy (%) of the MRF under different hyper-parameter values.

Results . We adopt two classical criteria to evaluate the performance of different models, includingthe pixel-wise accuracy (PA) and the intersection over union (IoU). As shown in Figure 3, theperformance of MRF is better than that of RF. Compare with existing consistent RF, the improvementof MRF is more signiﬁcant. We also visualize the segmentation results of MRF, as shown in Figure2. Although the performance of MRF may not as good as some state-of-the-art deep learning basedmethods, it still achieves plausible results.

In this part, we evaluate the performance of the consistent MRF under different hyper-parameters B and B . Speciﬁcally, we consider a range of [0 , for both B and B , and other hyper-parametersare the same as those stated in Section 5.1. Besides, the performance of each tree in MFR with respectto the privacy budget is shown in the Appendix .Figure 4 displays the results for six datasets representing small, medium and large datasets. It showsthat the performance of the MRF is signiﬁcantly improved as B increases from zero, and it furtherbecomes relatively stable when B ≥ . Similarly, the performance also improves as B increasesfrom zero, but the effect is not obvious. When B is too small, the resulting multinomial distributionswould allow too much randomness, leading to the poor performance of the MRF. Besides, as shownin the ﬁgure, although the optimal values of B and B may depend on the speciﬁc characteristics ofa dataset, such as the outcome scale and the dimension of the impurity decrease vector, at our defaultsetting ( B = B = 10 ), the MRF achieves competitive performance in all datasets. In this paper, we propose a new random forest framework, dubbed multinomial random forest (MRF),based on which we analyze its consistency and privacy-preservation property. In the MRF, wepropose two impurity-based multinomial distributions for the selection of split feature and split value.Accordingly, the best split point has the highest probability to be chosen, while other candidate splitpoints that are nearly as good as the best one will also have a good chance to be selected. Thissplit process is more reasonable, compared with the greedy split criterion used in existing methods.Besides, we also introduce the exponential mechanism of differential privacy for selecting the labelof a leaf to discuss the privacy-preservation of MRF. Experiments and comparisons demonstrate thatthe MRF remarkably surpasses existing consistent random forest variants, and its performance is onpar with Breiman’s random forest. It is by far the ﬁrst random forest variant that is consistent and hascomparable performance to the standard random forest.8

Broader Impact

Data privacy is critical for data security and consistency is also an important theoretical property inthis big data era. As such, our work has positive impacts in general.Speciﬁcally, from the aspect of positive broader impacts, (1) our work theoretically analyzed thedata privacy of the proposed RF framework, which allows the trade-off between performance anddata privacy; (2)

MRF is the ﬁrst consistent RF variant whose performance is on par with that ofstandard RF, therefore it can be used as an alternative of RF and inspire further research in this area; (3) the proposed impurity-based random splitting process is empirically veriﬁed to be more effectivecompared with the standard greedy approach, whereas its principle is not theoretically discussed. Itmay further inspire the theoretical analysis of this method.For the negative broader impact, the proposed method further veriﬁed the potential of data, whichmay further enhance the concerns about data privacy.

References [1] Leo Breiman. Random forests.

Machine learning , 45(1):5–32, 2001.[2] Tim F Cootes, Mircea C Ionita, Claudia Lindner, and Patrick Sauer. Robust and accurate shapemodel ﬁtting using random forest regression voting. In

ECCV . Springer, 2012.[3] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel Rota Bulo. Deep neuraldecision forests. In

ICCV , 2015.[4] Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavaldà. Newensemble methods for evolving data streams. In

SIGKDD , 2009.[5] Caiming Xiong, David Johnson, Ran Xu, and Jason J Corso. Random forests for metric learningwith implicit pairwise position dependence. In

SIGKDD , 2012.[6] GÃŠrard Biau. Analysis of a random forests model.

Journal of Machine Learning Research ,13(Apr):1063–1095, 2012.[7] Leo Breiman. Consistency for a simple model of random forests. Technical Report 670,Statistical Department, University of California at Berkeley, 2004.[8] GÃŠrard Biau, Luc Devroye, and GÃ ˛Abor Lugosi. Consistency of random forests and otheraveraging classiﬁers.

Journal of Machine Learning Research , 9(Sep):2015–2033, 2008.[9] Robin Genuer. Variance reduction in purely random forests.

Journal of Nonparametric Statistics ,24(3):543–562, 2012.[10] Misha Denil, David Matheson, and Nando De Freitas. Narrowing the gap: Random forests intheory and in practice. In

ICML , 2014.[11] Yisen Wang, Shu-Tao Xia, Qingtao Tang, Jia Wu, and Xingquan Zhu. A novel consistentrandom forest framework: Bernoulli random forests.

IEEE transactions on neural networksand learning systems , 29(8):3510–3523, 2017.[12] Tin Kam Ho. The random subspace method for constructing decision forests.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 20(8):832–844, 1998.[13] Thomas G Dietterich. An experimental comparison of three methods for constructing ensemblesof decision trees: Bagging, boosting, and randomization.

Machine learning , 40(2):139–157,2000.[14] Leo Breiman.

Classiﬁcation and regression trees . Routledge, 2017.[15] Nicolai Meinshausen. Quantile regression forests.

Journal of Machine Learning Research ,7(Jun):983–999, 2006.[16] Zhi-Hua Zhou and Ji Feng. Deep forest: towards an alternative to deep neural networks. In

AAAI , 2017.[17] Yi Lin and Yongho Jeon. Random forests and adaptive nearest neighbors.

Journal of theAmerican Statistical Association , 101(474):578–590, 2006.[18] Hemant Ishwaran and Udaya B Kogalur. Consistency of random survival forests.

Statistics &probability letters , 80(13-14):1056–1064, 2010.919] Misha Denil, David Matheson, and Nando Freitas. Consistency of online random forests. In

ICML , 2013.[20] Susan Athey, Julie Tibshirani, Stefan Wager, et al. Generalized random forests.

The Annals ofStatistics , 47(2):1148–1178, 2019.[21] Siavash Haghiri, Damien Garreau, and Ulrike von Luxburg. Comparison-based random forests.In

ICML , 2018.[22] Noman Mohammed, Rui Chen, Benjamin C. M. Fung, and Philip S. Yu. Differentially privatedata release for data mining. In

SIGKDD , 2011.[23] Abhijit Patil and Sanjay Singh. Differential private random forest. In

ICACCI , 2014.[24] Cynthia Dwork. Differential privacy. In

ICALP , 2006.[25] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise tosensitivity in private data analysis. In

TCC , 2006.[26] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In

FOCS , 2007.[27] Luc Devroye, László Györﬁ, and Gábor Lugosi.

A probabilistic theory of pattern recognition ,volume 31. Springer Science & Business Media, 2013.[28] Frank McSherry. Privacy integrated queries: an extensible platform for privacy-preserving dataanalysis.

Communications of the ACM , 53(9):89–97, 2010.[29] Janez Demšar. Statistical comparisons of classiﬁers over multiple data sets.

Journal of Machinelearning research , 7(Jan):1–30, 2006.[30] Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. Face description with local binarypatterns: Application to face recognition.

IEEE transactions on pattern analysis and machineintelligence , 28(12):2037–2041, 2006.[31] Robert M. Haralick, Its’hak Dinstein, and K. Shanmugam. Textural features for image classiﬁ-cation.

IEEE Transactions On Systems Man And Cybernetics , 3(6):610–621, 1973.[32] Fei Tony Liu, Kai Ming Ting, Yang Yu, and Zhi-Hua Zhou. Spectrum of variable-random trees.

Journal of Artiﬁcial Intelligence Research , 32:355–384, 2008.10 ppendix

A Omitted Proofs

A.1 The Proof of Lemma 3Lemma 3.

In the MRF, the probability that any given feature A is selected to split at each node haslower bound P > .Proof. Recall that the normalized impurity decrease vector ˆ I ∈ [0 , D . When ˆ I = (1 , , · · · , ,the probability that the ﬁrst feature is selected for splitting is the largest, and when ˆ I = (0 , , · · · , ,the probability reaches smallest. Therefore P (cid:44)

11 + ( D − e B ≤ Pr ( v ∈ A ) ≤ e B e B + ( D − . A.2 The Proof of Lemma 4Lemma 4.

Suppose m is the number of possible splitting values of feature A , similar to Lemma 3, theprobability that a value is selected for splitting satisﬁes the following restriction:

11 + ( m − e B ≤ Pr( v ) ≤ e B e B + ( m − . (7)In this case, Pr (cid:16) v ∈ (cid:83) N − i =2 A ( i ) | A (cid:17) = (cid:90) (cid:83) N − i =2 A ( i ) f ( v ) dv (cid:90) A f ( v ) dv ≥ lim m → + ∞  (cid:90) (cid:83) N − i =2 A ( i )

11 + ( m − e B dv (cid:90) A e B e B + ( m − dv  = lim m → + ∞ N − N · e B + ( m − e B + ( m − e B = N − N e − B (cid:44) P . (8) A.3 The Proof of Theorem 1Theorem 1.

Suppose that X is supported on [0 , D and have non-zero density almost everywhere,the cumulative distribution function of the split points is right-continuous at 0 and left-continuous at1. If B → ∞ , MRF is consistent when k → ∞ and k/n → as n → ∞ .Proof. When B → ∞ , the prediction in each node is based on majority vote, therefore it meets theprerequisite of Lemma 2. Accordingly, we can prove the consistency of MRF by showing that MRFmeets two requirements in Lemma 2. 11irstly, since MRF requires |N E ( X ) | ≥ k where k → ∞ as n → ∞ , |N E ( X ) | → ∞ when n → ∞ is trivial.Let V m ( a ) denote the size of the a -th feature of N m ( X ) , where X falls into the node N m ( X ) at m -th layer. To prove diam ( N ( X )) → in probability, we only need to show that E ( V m ( a )) → for all A a ∈ A . For a given feature A a , let V ∗ m ( a ) denote the largest size of this feature among allchildren of node N m − ( X ) . By Lemma 4, we can obtain E ( V ∗ m ( a )) ≤ (1 − P ) V m − ( a ) + P N − N V m − ( a ) = (cid:18) − N P (cid:19) V m − ( a ) . (9)By Lemma 3, we can know E ( V m ( a )) ≤ (1 − P ) V m − ( a ) + P E ( V ∗ m ( a )) = (cid:18) − N P P (cid:19) V m − ( a ) . (10)Since V ( a ) = 1 , E ( V m ( a )) ≤ (cid:18) − N P P (cid:19) m . (11)Unlike the deterministic rule in the Breiman , the splitting point rule in our proposed MRF hasrandomness, therefore the ﬁnal selected splitting point can be regarded as a random variable W i ( i ∈{ , · · · , m } ) , whose cumulative distribution function is denoted by F W i .Let M = min( W , − W ) denotes the size of the root smallest child, we have Pr( M ≥ σ /m ) = Pr( σ /m ≤ W ≤ − σ /m ) = F W (1 − σ /m ) − F W ( σ /m ) . (12)WLOG, we normalize the values of all attributes to the range [0 , for each node, then after m splits,the smallest child at the m -th layer has the size at least σ with the probability at least m (cid:89) i =1 (cid:16) F W i (1 − σ /m ) − F W i ( σ /m ) (cid:17) . (13)Since F W i is right-continuous at 0 and left-continuous at 1, ∀ α > , ∃ σ, α > s.t. m (cid:89) i =1 (cid:16) F W i (1 − σ /m ) − F W i ( σ /m ) (cid:17) > (1 − α ) m > − α. Since the distribution of X has a non-zero density, each node has a positive measure with respect to µ X . Deﬁning p = min N :a node at m − th level µ X ( N ) , we know p > since the minimum is over ﬁnitely many nodes and each node contains a set ofpositive measure.Suppose the data set with size n , the number of data points falling in the node A , where A denotesthe m -th level node with measure p , follows Binomial ( n, p ) . Note that this node A is the onecontaining the smallest expected number of samples. WLOG, considering the partition rate = 1 ,the expectation number of estimation points in A is np/ . From Chebyshev’s inequality, we have Pr (cid:0) |N E ( X ) | < k (cid:1) = Pr (cid:0) |N E ( X ) | − np < k − np (cid:1) ≤ Pr (cid:0)(cid:12)(cid:12) |N E ( X ) | − np (cid:12)(cid:12) > (cid:12)(cid:12) k − np (cid:12)(cid:12)(cid:1) ≤ np (1 − p )2 | k − np | = p (1 − p )2 n | kn − p | , (14)where the ﬁrst inequality holds since k − np is negative as n → ∞ and the second one is byChebyshev’s inequality.Since the right hand side goes to zero as n → ∞ , the node contains at least k estimation points inprobability. By the stopping condition, the tree will grow inﬁnitely often in probability, i.e. , m → ∞ . (15)By (11) and (15), the theorem is proved. 12 .4 The Proof of Lemma 5Lemma 5. The impurity-based multinomial distribution of feature selection M ( φ ) is essentially theexponential mechanism of differential privacy, and satisﬁes B -differential privacy.Proof. As we all know, the softmax function is f ( x ) j = exp ( z j ) (cid:80) Di =1 exp ( z i ) . Obviously, the above formula is the same as the exponential mechanism (see the formula (2) inDeﬁnition 2). In what follows, we prove M ( φ ) satisﬁes B -differential privacy.For any two neighboring datasets D S and D (cid:48) S , and any selected feature A ∈ A , we can obtain exp (cid:16) B ˆ I ( D S ,A )2 (cid:17) exp (cid:16) B ˆ I ( D (cid:48) S ,A )2 (cid:17) = exp  B (cid:16) ˆ I ( D S , A ) − ˆ I ( D (cid:48) S , A ) (cid:17)  (cid:54) exp (cid:18) B (cid:19) , (16)where the quality function ˆ I ( D S , A ) represents the a -th item of the normalized feature vector ˆ I basedon the structure points dataset D S , and the corresponding sensitivity is through the normalizedoperation ( i.e. , (cid:52) ˆ I = max ∀ A, D S , D (cid:48) S | ˆ I ( D S , A ) − ˆ I ( D (cid:48) S , A ) | = 1 ). Accordingly, the privacy of thesplit feature mechanism satisﬁes that for any output A of M ( φ ) , we can obtain Pr[ M ( φ, D S ) = A ]Pr[ M ( φ, D (cid:48) S ) = A ] = exp (cid:18) B I ( D S,A )2 (cid:19)(cid:80) A (cid:48)∈A exp (cid:16) B I ( D S,A (cid:48) )2 (cid:17) exp (cid:18) B I ( D(cid:48)

S,A )2 (cid:19)(cid:80) A (cid:48)∈A exp (cid:18) B I ( D(cid:48)

S,A (cid:48) )2 (cid:19) = exp (cid:16) B ˆ I ( D S ,A )2 (cid:17) exp (cid:16) B ˆ I ( D (cid:48) S ,A )2 (cid:17) · (cid:80) A (cid:48) ∈A exp (cid:16) B ˆ I ( D (cid:48) S ,A (cid:48) )2 (cid:17)(cid:80) A (cid:48) ∈A exp (cid:16) B ˆ I ( D S ,A (cid:48) )2 (cid:17) (cid:54) exp (cid:18) B (cid:19) ·  (cid:80) A (cid:48) ∈A exp (cid:0) B (cid:1) exp (cid:16) B ˆ I ( D S ,A (cid:48) )2 (cid:17)(cid:80) A (cid:48) ∈A exp (cid:16) B ˆ I ( D S ,A (cid:48) )2 (cid:17)  (cid:54) exp (cid:18) B (cid:19) · exp (cid:18) B (cid:19)  (cid:80) A (cid:48) ∈A exp (cid:16) B ˆ I ( D S ,A (cid:48) )2 (cid:17)(cid:80) A (cid:48) ∈A exp (cid:16) B ˆ I ( D S ,A (cid:48) )2 (cid:17)  = exp ( B ) . Therefore, for each layer of a tree, the privacy budget consumed by the split mechanism of features is B . That is, M ( φ ) satisﬁes B -differential privacy. A.5 The Proof of Lemma 6Lemma 6.

The impurity-based multinomial distribution M ( ϕ ) of split value selection is essentiallythe exponential mechanism of differential privacy, and satisﬁes B -differential privacy.Proof. Similar to the proof of Lemma 5, the split value selection M ( ϕ ) is essentially the exponentialmechanism of differential privacy. For any two neighboring datasets D S and D (cid:48) S , and any selectedsplit value a j [ i ] ∈ a j = { a j [1] , . . . , a j [ m ] } of the feature A j , we can obtain exp (cid:16) B ˆ I ( j ) ( D S ,a j [ i ])2 (cid:17) exp (cid:16) B ˆ I ( j ) ( D (cid:48) S ,a j [ i ])2 (cid:17) = exp  B (cid:16) ˆ I ( j ) ( D S , a j [ i ]) − ˆ I ( j ) ( D (cid:48) S , a j [ i ]) (cid:17)  (cid:54) exp (cid:18) B (cid:19) , where the quality function ˆ I ( j ) ( D S , a j [ i ]) represents the i -th item of the normalized feature vector ˆ I ( j ) based on the structure points dataset D S , and the corresponding sensitivity is through thenormalized operation. 13ccordingly, for any output a j [ i ] of M ( ϕ ) , we can obtain Pr (cid:2) M ( ϕ, D S ) = a j [ i ] (cid:3) Pr (cid:2) M ( ϕ, D (cid:48) S ) = a j [ i ] (cid:3) = exp (cid:18) B I ( j )( D S,aj [ i ])2 (cid:19)(cid:80) aj [ k ] ∈ Aj exp (cid:18) B I ( j )( D S,aj [ k ])2 (cid:19) exp (cid:18) B I ( j )( D(cid:48)

S,aj [ i ])2 (cid:19)(cid:80) aj [ k ] ∈ Aj exp (cid:18) B I ( j )( D(cid:48)

S,aj [ k ])2 (cid:19) = exp (cid:16) B ˆ I ( j ) ( D S ,a j [ i ])2 (cid:17) exp (cid:16) B ˆ I ( j ) ( D (cid:48) S ,a j [ i ])2 (cid:17) · (cid:80) a j [ k ] ∈ A j exp (cid:16) B ˆ I ( j ) ( D (cid:48) S ,a j [ k ])2 (cid:17)(cid:80) a j [ k ] ∈ A j exp (cid:16) B ˆ I ( j ) ( D S ,a j [ k ])2 (cid:17) (cid:54) exp (cid:18) B (cid:19) ·  (cid:80) a j [ k ] ∈ A j exp (cid:0) B (cid:1) exp (cid:16) B ˆ I ( j ) ( D S ,a j [ k ])2 (cid:17)(cid:80) a j [ k ] ∈ A j exp (cid:16) B ˆ I ( j ) ( D S ,a j [ k ])2 (cid:17)  (cid:54) exp (cid:18) B (cid:19) exp (cid:18) B (cid:19)  (cid:80) a j [ k ] ∈ A j exp (cid:16) B ˆ I ( j ) ( D S ,a j [ k ])2 (cid:17)(cid:80) a j [ k ] ∈ A j exp (cid:16) B ˆ I ( j ) ( D S ,a j [ k ])2 (cid:17)  = exp ( B ) . Thus, the selection mechanism of split value for a speciﬁc feature satisﬁes B -differential privacy. A.6 The Proof of Lemma 6Lemma 7.

The label selection of each leaf in a tree satisﬁes B -differential privacy.Proof. For any two neighboring datasets D E and D (cid:48) E , and any predicted label c ∈ K = { , , . . . , K } of x in a speciﬁc leaf, we can obtain exp (cid:16) B η ( D E ,c )2 (cid:17) exp (cid:16) B η ( D (cid:48) E ,c )2 (cid:17) = exp  B (cid:16) η ( D E , c ) − η ( D (cid:48) E , c ) (cid:17)  (cid:54) exp (cid:18) B (cid:19) , (17)where the quality function η ( D E , c ) is equivalent to η c ( x ) and represents the predict label c of thesample x based on the estimation points dataset D E . It is worth noting that η ( D E , c ) is the empiricalprobability that sample x has the label c , and thus the corresponding sensitive is . Then, for anyoutput c ∈ { , , . . . , K } of h ( x ) , we can obtain Pr[ h ( x , D E ) = c ]Pr[ h ( x , D (cid:48) E ) = c ] = exp (cid:18) B η ( D E,c )2 (cid:19)(cid:80) c (cid:48)∈K exp (cid:16) B η ( D E,c (cid:48) )2 (cid:17) exp (cid:18) B η ( D(cid:48)

E,c )2 (cid:19)(cid:80) c (cid:48)∈K exp (cid:18) B η ( D(cid:48)

E,c (cid:48) )2 (cid:19) = exp (cid:16) B η ( D E ,c )2 (cid:17) exp (cid:16) B η ( D (cid:48) E ,c )2 (cid:17) · (cid:80) c (cid:48) ∈K exp (cid:16) B η ( D (cid:48) E ,c (cid:48) )2 (cid:17)(cid:80) c (cid:48) ∈K exp (cid:16) B η ( D E ,c (cid:48) )2 (cid:17) (cid:54) exp (cid:18) B (cid:19) ·  (cid:80) c (cid:48) ∈K exp (cid:0) B (cid:1) exp (cid:16) B η ( D E ,c (cid:48) )2 (cid:17)(cid:80) c (cid:48) ∈K exp (cid:16) B η ( D E ,c (cid:48) )2 (cid:17)  (cid:54) exp (cid:18) B (cid:19) · exp (cid:18) B (cid:19)  (cid:80) c (cid:48) ∈K exp (cid:16) B η ( D E ,c (cid:48) )2 (cid:17)(cid:80) c (cid:48) ∈K exp (cid:16) B η ( D E ,c (cid:48) )2 (cid:17)  = exp ( B ) . lgorithm 1 Decision Tree Training in MRF:

M T ree () Input:

Structure points D S , estimation points D E and hyper-parameters k , B , B . Output:

A decision tree T in MRF. if |D E | > k then Calculate the impurity decrease of all possible split points v ij . Select the largest impurity decrease of each feature to create a vector I , calculate the normalizedvector ˆ I , and compute the probabilities φ = softmax( B ˆ I ) . Select a split feature randomly according to the multinomial distribution M ( φ ) . Calculate the normalized vector ˆ I ( j ) for the selected split feature f j , and compute the proba-bilities ϕ = softmax( B ˆ I ( j ) ) . Select a split value randomly according to the multinomial distribution M ( ϕ ) . D S and D E are correspondingly split into two disjoint subsets D S l , D S r and D E l , D E r , respectively. T.lef tchild ← M T ree ( D S l , D E l , k, B , B ) T.rightchild ← M T ree ( D S r , D E r , k, B , B ) end if Return:

A decision tree T in MRFFigure 5: An illustration of the data partition.Since each leaf divides the dataset D E into disjoint subsets, according to Property 2, the predictionmechanism of the each tree satisﬁes B -differential privacy. A.7 The Proof of Theorem 2Theorem 2.

The proposed MRF satisﬁes (cid:15) -differential privacy when the hyper-parameters B , B and B satisfy B + B = (cid:15)/ ( d · t ) and B = (cid:15)/t , where t is the number of trees in the MRF and d is the depth of a tree such that d (cid:54) O ( |D E | k ) .Proof. Based on Property 1 together with Lemma 5 and Lemma 6 , the privacy budget consumed foreach layer of a tree is B + B = (cid:15)/ ( d · t ) . Since the depth of a tree is d , the total privacy budgetconsumed by the generation of tree structure is d ( B + B ) = (cid:15)/t . Since the datasets D S and D E aredisjoint, according to Property 2, the total privacy budget of a tree is max { d ( B + B ) , B } = (cid:15)/t .As a result, the consumed privacy budget of the MRF containing t trees is (cid:15)t · t = (cid:15) , which impliesthat the MRF satisﬁes (cid:15) -differential privacy. B More Details about the Training Process of MRF

We provide more details about the training process of MRF. The illustration of partition process, andthe pseudo code of training process is shown in Figure 5 and Algorithm 1, respectively.15igure 6: A diagram showing the differences among the standard RF,

Denil14 , BRF, and MRF. For

Denil14 and BRF, they choose split point in a greedy way mostly, while holding a small or evennegligible probability in selecting anther point randomly. Their split process is very similar to the oneof standard random forests. In contrast, in MRF, two impurity-based multinomial distributions areused for randomly selecting a split feature and a speciﬁc split point respectively. All randomness inconsistent RF variants is aiming to fulﬁll the consistency, whereas the one in MRF is more reasonable.

C The Discussion of Parameter Settings in MRF

Firstly, we discuss the parameters settings from the aspect of privacy-preservation by setting B , B and B to ensure that MRF satisﬁes (cid:15) -differential privacy.Suppose the number of trees and the depth of each tree are t and d , respectively. To fulﬁll the (cid:15) -differential privacy, we can evenly allocate the total privacy budget (cid:15) to each tree, i.e. , the privacybudget of each tree is (cid:15)/t . For each tree, the upper bound of depth d is approximately O ( |D E | k ) , i.e. , d (cid:54) O ( |D E | k ) . Accordingly, we can directly set d = |D E | k and evenly allocate the total privacy budget (cid:15)/t to each layer, i.e. , the privacy budget of each layer is (cid:15)/ ( d · t ) . As such, we can set B , B and B to B + B = (cid:15)/ ( d · t ) and B = (cid:15)/t to ensure that MRF satisﬁes (cid:15) -differential privacy.From another perspective, the hyper-parameters play a role in regulating the relative probabilitiesunder which the ’best’ candidate is selected. Speciﬁcally, the larger B , the smaller noise will beadded in φ , and thus the split feature with the largest impurity decrease ˆ I i ( i = 1 , , . . . , D ) has ahigher probability of being selected. Similarly, B and B control the noises added to the split valueselection and the label selection, respectively. In addition, if B → ∞ , regardless of the trainingset partitioning, when B = B = 0 , all features and split values have the same probability to beselected, and therefore the MRF would become a completely random forest [32]; when B , B → ∞ ,there is no added noise. In this case, the MRF would become the Breiman’s original RF, whose set ofcandidate features always contains all the features. D Comparison among Different RFs

In this section, we ﬁrst explain the differences between the MRF and the Breiman’s random forest(Breiman2001), and then describe the similarities and differences between the MRF and two otherrecently proposed consistent RF variants,

Denil14 [10] and BRF [11]. The differences are summarizedin the Figure 6.In the standard RF, node split has two steps: 1) a restrictive step in which only a subset of randomlyselected features are considered, and 2) a greedy step in which the best combination of split featureand split value is chosen so that the impurity decrease is maximized after the split. Both steps mayhave limitations. The restriction in the ﬁrst step is to increase the diversity of the trees but at the cost16f missing an informative feature. The greedy approach in the second step may be too exclusive,especially when there are several choices that are not signiﬁcantly different from each other. TheMRF improves on both steps. First, we introduce an approach to allow a more informed selection ofsplit features. We compute the largest possible impurity decrease for every feature to represent thepotentials of the features, and then use them as a basis to construct a multinomial distribution forfeature selection. Second, once a split feature is selected, we introduce randomness in the selectionof a split value for the feature. We consider all possible values and use their corresponding impuritydecreases as a basis to construct a second multinomial distribution for split point selection. This way,the best split point in the standard RF still has the highest probability to be selected although it is notalways selected as in the greedy approach. These new procedures allow us to achieve a certain degreeof randomness necessary for the proof of consistency and to still maintain the overall quality of thetrees. In the MRF, the diversity of the trees is achieved through the randomness in both steps (plusdata partitioning); in contrast, in the standard RF, the diversity is achieved through the restrictivemeasure in the ﬁrst step (plus bootstrap sampling).All three RF variants –

Denil14 , BRF and MRF – have the same procedure of partitioning the trainingset into two subsets to be used separately for tree construction and for prediction. This procedureis necessary for the proof of consistency, which is in contrast to the standard RF, where a bootstrapdataset is used for both tree construction and prediction.For feature selection, all three RF variants have a procedure to ensure that every feature has a chanceto be splitted, which is necessary for proving consistency. However, the MRF is completely differentfrom the other two methods in this aspect. The other two methods introduce a simple randomdistribution (a Poisson distribution in

Denil14 and a Bernoulli distribution in BRF), but they arerestrictive as they use a random subset of features as in the standard RF. In the MRF, the randomnessis built into an impurity-based multinomial distribution deﬁned over all the features.When determining the split point, all three RF variants have some level of randomness to ensure thatevery split point has a chance to be used, which is again necessary for proving consistency. However,all the methods except the MRF involve a greedy step, in which the ﬁnal selection is deterministic.That is, whenever there is a pool of candidate split points, those methods always select the “best” onethat has the largest impurity decrease. Speciﬁcally, in

Denil14 , a pool of randomly selected candidatesplit points is created and then searched for the “best” one. In BRF, a Bernoulli distribution is usedso that with a small probability, the split is based on a randomly selected point, and with a largeprobability, the choice is the same as in the standard RF, that is, the “best” among all possible splitpoints. This greedy approach of choosing the “best” split point at every node may be too exclusive.This is especially true when there are several candidate split points that are not signiﬁcantly differentfrom each other. In this case, it may not be a good strategy to choose whichever one that happensto have the largest impurity decrease while ignoring all others that are nearly as good. This issue issolved by the non-greedy approach in the MRF, in which a split point is chosen randomly accordingto another impurity-based multinomial distribution. This way, the “best” split point has the highestprobability to be chosen, but other candidate split points that are nearly as good as the “best” onewill also have a good chance to be selected. This ﬂexibility in split point selection in the MRF maypartially explain its good performance to be shown in the next section.In addition, the prediction of the MRF trees are also different from those of

Denil14 , BRFand RF. Speciﬁcally the predicted label is randomly selected with a probability proportional to exp (cid:16) B η ( c ) ( x )2 (cid:17) in MRF, whereas the prediction of other three methods are deterministic. This mod-iﬁcation is to ensure that the prediction process also satisﬁes the exponential mechanism, thereforethe privacy-preservation can be analyzed under the differential privacy framework. E The Description of UCI Benchmark Datasets

We conduct machine learning experiments on multiple UCI datasets used in previous consistentRF works [10, 11, 21]. These datasets cover a wide range of sample size and feature dimensions,and therefore they are representative for evaluating the performance of different algorithms. Thedescription of used datasets is shown in Table 2. 17able 2: The description of UCI benchmark datasets. D ATASET S AMPLES F EATURES C LASSES Z OO

101 17 7H

AYES

132 5 3E

CHO

132 12 2H

EPATITIS

155 19 2W

DBC

569 39 2T

RANSFUSION

748 5 2V

EHICLE

946 18 4M

AMMO

961 6 2M

ESSIDOR

EBSITE

ANKNOTE MC EAST AR MAGE

HESS DS ILT

INE -Q UALITY

HISHING

URSERY

ONNECT -4 67557 42 3

Table 3: Test accuracy ( % ) of a tree in MFR in terms of different privacy budgets B , B , and B . B B B DatasetsWDBC CMC CONNECT-41 0.05 0.05 94.71 52.58 72.205 0.25 0.25 94.95 52.79 74.1210 0.5 0.5 94.93 52.94 74.8420 1 1 95.21 53.25 76.84

F Performance of Differential Privacy

In this section, we simulate the performance of each tree in our MRF for different privacy budgets.We conduct the experiments in WDBC, CMC and CONNECT-4 dataset, which is the representativeof small, medium and large datasets, respectively. Speﬁcially, according to Theorem 2, each tree inthe MRF satisﬁes B -differential privacy, and the privacy budget consumed for each layer of a tree is B + B , which satisﬁes B + B = B /d . Therefore, Table 3 presents the performance changeswith respect to B . In the experiments, we set B = 1 , , , and respectively. Since we focus onthe trade-off between the accuracy and privacy, we can simply set B = B . Besides, we observe thatthe depth of each tree in MRF constructed based on selected datasets is no more than , thereforewe directly set d = 10 .From the table, we can see that when B increases, the performance of each tree increases, whichmeets the changing trend of differential privacy. Speciﬁcally, when the privacy budget is relativelylow, the added noise is relatively high, which results in reduced performance. On the contrary, whenthe privacy budget is relatively high, the added noise is relatively low, and thus the correspondingperformance will increase. Besides, we can observe that when B decreases, the performance ishardly reduced. Thus, in practice, we can set B to be relatively low, to ensure better security withoutreducing performance. F.1 Performance on Individual Trees

We further investigated why the MRF achieves such good performance by studying the performanceof MRF on individual trees. In our 10 times 10-fold cross-validation, 10,000 trees were generated foreach method. We compared the distribution of prediction accuracy over those trees between the MRFand the standard RF. Figure 7 displays the distributions for six datasets.18 a) ECHO (b) CMC (c) ADS(d) WDBC (e) CAR (f) CONNECT-4a) ECHO (b) CMC (c) ADS(d) WDBC (e) CAR (f) CONNECT-4