[PDF] Estimation and Inference with Trees and Forests in High Dimensions

Abstract

We analyze the finite sample mean squared error (MSE) performance of regression trees and forests in the high dimensional regime with binary features, under a sparsity constraint. We prove that if only r of the d features are relevant for the mean outcome function, then shallow trees built greedily via the CART empirical MSE criterion achieve MSE rates that depend only logarithmically on the ambient dimension d . We prove upper bounds, whose exact dependence on the number relevant variables r depends on the correlation among the features and on the degree of relevance. For strongly relevant features, we also show that fully grown honest forests achieve fast MSE rates and their predictions are also asymptotically normal, enabling asymptotically valid inference that adapts to the sparsity of the regression function.

Full PDF

aa r X i v : . [ m a t h . S T ] J u l Estimation and Inference with Trees and Forestsin High Dimensions ∗ Vasilis SyrgkanisMicrosoft Research [email protected]

Manolis ZampetakisMIT [email protected]

Abstract

We analyze the ﬁnite sample mean squared error (MSE) performance of regression treesand forests in the high dimensional regime with binary features, under a sparsity constraint.We prove that if only r of the d features are relevant for the mean outcome function, thenshallow trees built greedily via the CART empirical MSE criterion achieve MSE rates thatdepend only logarithmically on the ambient dimension d . We prove upper bounds, whoseexact dependence on the number relevant variables r depends on the correlation among thefeatures and on the degree of relevance. For strongly relevant features, we also show thatfully grown honest forests achieve fast MSE rates and their predictions are also asymptoticallynormal, enabling asymptotically valid inference that adapts to the sparsity of the regressionfunction. Regression Trees [BFOS84] and their ensemble counterparts, Random Forests [Bre01], are oneof the most widely used estimation methods by machine learning practitioners. Despite theirwidespread use, their theoretical underpinnings are far from being fully understood. Early worksestablished sample complexity bounds of decision trees and other data-adaptive partitioningestimators [Nob96, LN96, MM00]. However, sample complexity bounds do not address thecomputational aspect of how to choose the best tree in the space. In practice, trees and forestsare constructed in a greedy fashion, typically identifying the most empirically informative splitat east step; an approach pioneered by [BFOS84, Bre01]. The consistency and estimation rates ofsuch greedily built trees has proven notoriously more difﬁcult to analyze.Recent breakthrough advances has shown that such greedily built trees are asymptoticallyconsistent [Bia10, DMdF14, SBV15] in the low dimensional regime, where the number of fea-tures is a constant independent of the sample size. In another line of work, [MH16, WA18]provide asymptotic normality results for honest versions of Random Forests, where each tree isconstructed using a random sub-sample of the original data and further each tree constructionalgorithm sub-divides the sub-sample into a random half-sample that is used for construction ofthe tree structure and a separate half-sample used for the leaf estimates. However, these resultsare typically asymptotic or their ﬁnite sample guarantees scale exponentially with the number ∗ Accepted for presentation at the Conference on Learning Theory (COLT) 2020

1f features. Random Forests are used in practice to address estimation with high-dimensionalfeatures. Hence, these works, though of immense theoretical importance in our understanding ofadaptively built trees and forests, they do not provide theoretical foundations of the ﬁnite samplesuperiority of these algorithms in practice.In this work, we analyze the performance of regression trees and forests in the high-dimensionalregime, where the number of features can grow exponentially with the number of samples. Tofocus on the high-dimensionality of the features (as opposed to the ability of forests to sub-dividecontinuous variable spaces), we constrain our analysis to the case when all features are binary.We show that trees and forests built greedily based on the original CART criterion, provablyadapt to sparsity: when only a subset R , of size r , of the features are relevant, then the meansquared error of appropriately shallow trees, or fully grown honest forests, scales exponentiallyonly with the number of relevant features and depends only logarithmically on the overall num-ber of features. We analyze two variants of greedy tree algorithms: in the level-split variant, thesame variable is chosen at all the nodes of each level of the tree and is greedily chosen so as tomaximize the overall variance reduction. In the second variant, which is the most popular inpractice, the choice of the next variable to split on is locally decided at each node of the tree.We identify three regimes, each providing different dependence on the number of relevantfeatures. When the relevant variables are “weakly” relevant (in the sense that there is not strongseparation between the relevant and irrelevant variables in terms of their ability to reduce vari-ance), then shallow trees achieve “slow rates” on the mean squared error of the order of 2 r / √ n ,when variables are independent, and 1/ n ( r + ) , when variables are dependent. When the rel-evant variables are “strongly” relevant, in that there is a separation in their ability to reducevariance as compared to the irrelevant ones, by a constant β min , then we show that greedily builtshallow trees and fully grown honest forests can achieve fast parametric mean squared errorrates of the order of 2 r / ( β min n ) .When variables are strongly relevant, we also show that the predictions of sub-sampled hon-est forests have an asymptotically normal distribution centered around their true values andwhose variance scales at most as O ( r log ( n ) / ( β min n )) . Thus sub-sampled honest forests areprovably a data-adaptive method for non-parametric inference, that adapts to the latent spar-sity dimension of the data generating distribution, as opposed to classical non-parametric re-gression approaches, whose variance would deteriorate drastically with the overall number offeatures. Our results show that, at least for the case of binary features, forest based algorithmscan offer immense improvement on the statistical power of non-parametric hypothesis tests inhigh-dimensional regimes.The main crux of our analysis is showing bounds on the decay of the bias of decision trees,constructed via the mean-squared-error criterion. In particular, we show that either a relevantvariable leads to a large decrease in the mean squared error, in which case we prove that withhigh probability it is chosen in the ﬁrst few levels of the tree or if not then it’s impact on the meansquared due to the fact that the algorithm failed to choose it can be controlled. For achievingthe fast rates of 1/ n for shallow trees, we also develop a new localized Rademacher analysis[BBM02, Wai19] for adaptive partitioning estimators [GKKW06] to provide fast rates for the“variance” part of the MSE. Our results on honest forests, utilize recent work on the concentrationand asymptotic normality of sub-sampled estimators [MH16, WA18, FLW18] and combine itwith our proof of the bias decay, which for the case of strongly relevant features, as we show, isexponential in the number of samples. 2everal theoretical aspects of variants of CART trees and forests have been analyzed in therecent years [LJ02, Mei06, AG14, Bre04, Sco16]. The majority of these works deal with the lowdimensional regime and with few exceptions, these results deal with trees built with randomsplitting criteria or make no use of the fact that splits are chosen to minimize the CART mean-squared-error criterion. Arguably, closest to our work is that of [WW15], who consider a highdimensional regime with continuous variables, distributed according to a distribution with con-tinuous density and uniformly upper and lower bounded. The main focus of this work is provinga uniform concentration bound on the mean squared error objective locally at every node of theadaptively constructed tree and for this reason makes several assumptions not present in ourwork: e.g. minimum leaf size constraints, approximately balanced splits. Crucially, their resultson random forest consistency require an analogue of our β min condition, and do not offer resultswithout strong relevance. Moreover, their results on the consistency of forests requires a strongmodiﬁcation of the CART algorithm: split variables are selected based on initial median splits,subject to a lower bound on the decrease in variance, and then only the chosen variables areused in subsequent splits, not based on a CART criterion, but rather simply choosing randommedian splits; invoking an analysis of such random median trees in low dimensions by [DS16].Moreover, they provide no results on asymptotic normality.Apart from the literature related to Random Forests and the CART criterion, there has been agreat amount of work on the sparse non-parametric regression problem that we consider in thispaper. A lot of heuristic methods have been proposed such as: C p and AIC for additive models[HTF09], MARS [Fri91], Bayesian methods [GM97], Gaussian Processes [SB01] and more. Allthese methods are very successful in many practical scenarios but our theoretical understandingof their performance is limited. Our work is closer related to the works of [LW +

08, LC09, CD12,YT15] and citations therein, where they propose and theoretically analyze greedy algorithms thatexploit the sparsity of the input regression function and hence they provide a way to overcomethe curse of dimensionality of high dimensional data in non-parametric regression. The maindifference of this line of work with our paper is that we do not propose a new algorithm butinstead our goal is to analyze the performance of the heuristically proposed CART trees in thesetting of sparse high-dimensional non-parametric regression with binary features.

In this work we consider the non-parametric regression model with binary features. More pre-cisely, we assume that we have access to a training set D n = { ( x ( ) , y ( ) ) , . . . , ( x ( n ) , y ( n ) ) } , whichconsists of n i.i.d. samples of the form ( x ( i ) , y ( i ) ) , sampled independently from a common distri-bution D . Each sample is generated according to the following steps:1. x ( i ) is sampled from a distribution D x with support {

0, 1 } d ,2. ε ( i ) is sampled from a zero mean error distribution E with support [ − ] , i.e. E ε ∼E [ ε ] = ε ( i ) ∈ [ − ] ,3. y ( i ) = m ( x ( i ) ) + ε ( i ) , where m : {

0, 1 } d → [ − ] .The goal of the regression task is to estimate the target function m . Observe from the deﬁnitionof the non-parametric regression model we have that y ( i ) ∈ [ −

1, 1 ] . Our results apply to anycase where both the error distribution and the values of the target function are bounded, i.e. (cid:12)(cid:12)(cid:12) ε ( i ) (cid:12)(cid:12)(cid:12) ≤ H and | m ( x ) | ≤ H . In this case the sample complexity bounds and the rates should be3ultiplied by H . For simplicity we present the result for the case (cid:12)(cid:12)(cid:12) y ( i ) (cid:12)(cid:12)(cid:12) ≤ x ∈ {

0, 1 } d we deﬁne the vector x S as the sub-vector of x , where we keep onlythe coordinates with indices in S ⊆ [ d ] . Additionally, we deﬁne D x , S as the marginal distributionof D x in the coordinates S . Also, let x ( K ) be an arbitrary x j such that j ∈ K . For any trainingset D n we deﬁne the set D n , x = { x ( ) , . . . , x ( n ) } . For any set S ⊆ [ d ] and an index i ∈ [ d ] wesometimes use the notation S ∪ i to refer to S ∪ { i } .All of the results that we present in this paper are in the “high-dimensional” regime, wherethe number of features is very big but the number of relevant features is small. When this is truewe say that the function m is sparse as we explain in the following deﬁnition. Deﬁnition 2.1 (S parsity ) . We say that the target function m : {

0, 1 } d → R is r - sparse if and onlyif there exists a set R ⊆ [ d ] , with | R | = r and a function h : {

0, 1 } r → R such that for every z ∈ {

0, 1 } d it holds that m ( z ) = h ( z R ) . The set R is called the set of relevant features.Some of the results that we have in this paper signiﬁcantly improve if we make the assumptionthat the feature vector distribution D x is a product distribution. For this reason we deﬁne the“independence of features” assumption. Assumption 2.2 (I ndependence of F eatures ) . We assume that there exist Bernoulli distributions B , . . . , B d such that x ( i ) j is distributed independently according to B j . Now we give some deﬁnitions, related to the structure of a binary regression tree, that areimportant for the presentation of our results.

Deﬁnition 2.3 (P artitions , C ells and S ubcells ) . A partition P of {

0, 1 } d is a family of sets { A , . . . , A s } such that A j ⊆ {

0, 1 } d , A j ∩ A k = ∅ for all j , k ∈ [ s ] , and S sj = A j = {

0, 1 } d .Let P be a partition of {

0, 1 } d . Every element A of P is called a cell of P or just cell , if P isclear from the context. Every cell A has two subcells A i , A i with respect to any direction i , whichare deﬁned as A i = { x ∈ A | x i = } and A i = { x ∈ A | x i = } .For any x ∈ {

0, 1 } d and any partition P , we deﬁne P ( x ) ∈ P as the cell of P that contains x . Deﬁnition 2.4 (S plit O perator and R efinement ) . For any partition P of {

0, 1 } d , any cell A ∈ P and any i ∈ [ d ] we deﬁne the split operator S ( P , A , i ) that outputs a partition with the cell A splitwith respect to direction i . More formally S ( P , A , i ) = ( P \ { A } ) ∪ { A i , A i } . We can also extendthe deﬁnition of the split operator to splits over sets of dimensions I ⊆ [ d ] , inductively as follows:if i ∈ I then S ( P , A , I ) = S ( S ( S ( P , A , i ) , A i , I \ { i } ) , A i , I \ { i } ) .A partition P ′ is a reﬁnement of a partition P if every element of P ′ is a subset of an element of P . Then we say that P ′ is ﬁner than P and P is coarser than P ′ and we use the notation P ′ ⊑ P . In this section we present our analysis for the case when we run a level split greedy algorithmto build a tree or a forest that approximates the target function m . We start with the necessarydeﬁnitions to present the algorithm that we use. We refer to Appendix A.1 for a presentationand analysis of the population version of the algorithm, that is useful (though not necessary) togain intuition on the ﬁnite sample proof. 4iven a set of splits S , we deﬁne the expected mean squared error of S as follows: L ( S ) = E x ∼D x "(cid:18) m ( x ) − E w ∼D x [ m ( w ) | w S = x S ] (cid:19) (3.1) = E x ∼D x (cid:2) m ( x ) (cid:3) − E z S ∼D x , S "(cid:18) E w ∼D x [ m ( w ) | w S = z S ] (cid:19) , E x ∼D x (cid:2) m ( x ) (cid:3) − V ( S ) . (3.2)It is easy to see that L is a monotone decreasing function of S and hence V is a monotoneincreasing function of S . V can be viewed as a measure of heterogeneity of the within-leaf meanvalues of the target function m , from the leafs created by split S .We present results based on either one of two main assumption about V : approximate sub-modularity of V or strong sparsity . These assumptions play a crucial role in the analysis of theperformance of the random forest algorithm, both in the ﬁnite sample regime and the popula-tion regime (presented in Appendix A.1). It is not difﬁcult to see that without any assumption, nomeaningful result about the consistency of greedily grown trees in the high dimensional settingis possible, as we illustrate in Appendix G. Assumption 3.1 (A pproximate S ubmodularity ) . Let C ≥ , we say that the function V is C-approximate submodular, if and only if for any T , S ⊆ [ d ] , such that S ⊆ T, and any i ∈ [ d ] , it holds thatV ( T ∪ { i } ) − V ( T ) ≤ C · ( V ( S ∪ { i } ) − V ( S )) . Assumption 3.2 (S trong S parsity ) . A target function m : {

0, 1 } d → [ − ] is ( β , r ) -stronglysparse if m is r-sparse with relevant features R and the function V satisﬁes: V ( T ∪ { j } ) − V ( T ) + β ≤ V ( T ∪ { i } ) − V ( T ) , for all i ∈ R , j ∈ [ d ] \ R and T ⊂ [ d ] \ { i } . We next need to deﬁne the estimator that is produced by a level-split tree with a set of splits S . Given a set of splits S , a training set D n and an input x we can deﬁne the estimate m ( x ; S , D n ) as follows (for simplicity, we use m n ( · ; · ) , N n ( · ; · ) , and T n ( · , · ) instead of m ( · ; · , D n ) , N ( · ; · , D n ) and T ( · , · ; D n ) ): m n ( x ; S ) = N n ( x ; T n ( S , x )) ∑ j ∈ [ n ] { x ( j ) T n ( S , x ) = x T n ( S , x ) } · y ( j ) (3.3)where N n ( · ; · ) , T n ( · , · ) are deﬁned as follows N n ( x ; T ) = ∑ j ∈ [ n ] { x ( j ) T = x T } , T n ( S , x ) = argmax T ⊆ S , N n ( x ; T ) > | T | .In words, the function T n ( S , x ) returns the subset T of the splits S that we used to create the leafof the tree that contains x , until the leaf that corresponds to T contains at least one training point.The function N n ( x ; T ) is the number of training points in the leaf that contains x , when we splitacross the coordinates T .For the presentation of the algorithm we also need the deﬁnition of the empirical mean square rror , given as set of splits S , as follows: L n ( S ) = n ∑ j ∈ [ n ] (cid:16) y ( j ) − m n ( x ( j ) ; S ) (cid:17) = n ∑ j ∈ [ n ] (cid:16) y ( j ) (cid:17) − n ∑ j ∈ [ n ] m n ( x ( j ) ; S ) (3.4) , n ∑ j ∈ [ n ] (cid:16) y ( j ) (cid:17) − V n ( S ) . (3.5)It is easy to see that V n is a monotone increasing function and L n is a monotone decreasingfunction. We are now ready to present the level-split algorithm both with and without honesty.For this we use the honesty ﬂag h , where h = h = Algorithm 1:

Level Split Algorithm

Input: maximum number of splits log ( t ) , a training data set D n , honesty ﬂag h . Output: tree approximation of m . V ← D n , x if h = then Split randomly D n in half, D n /2 , D ′ n /2 , set n ← n /2, set V ← D ′ n , x ; Set P = {{

0, 1 } d } the partition associated with the root of the tree. For all 1 ≤ ℓ ≤ n , set P ℓ = ∅ . level ← − S ← ∅ . while level < log ( t ) do level ← level + P level + = ∅ . Select i ∈ [ d ] that maximizes V n ( S ∪ { i } ) [ see (3.2)]. for all A ∈ P level do Cut the cell A to cells A ik = { x | x ∈ A ∧ x i = k } , k =

0, 1. if (cid:12)(cid:12) V ∩ A i (cid:12)(cid:12) > = and (cid:12)(cid:12) V ∩ A i (cid:12)(cid:12) > = then P level + ← P level + ∪ { A i , A i } else P level + ← P level + ∪ { A } end end S ← S ∪ { i } end return ( P n , m n ) = ( P level + , x m n ( x ; S )) [ see (3.3)].Now we are ready to state our main result for the consistency of shallow trees with levelsplits as described in Algorithm 1. The proof of this theorem can be found in Appendix D. Theorem 3.3.

Let D n be i.i.d. samples from the non-parametric regression model y = m ( x ) + ε , wherem ( x ) ∈ [ − ] , ε ∼ E , E ε ∼E [ ε ] = and ε ∈ [ − ] . Let also S n be the set of splits chosen byAlgorithm 1, with input h = . Then the following statements hold.1. Under the submodularity Assumption 3.1, assuming that m is r-sparse and if we set as input thenumber of splits to be log ( t ) = C · rC · r + ( log ( n ) − log ( log ( d / δ ))) , then it holds that P D n ∼D n E x ∼D x h ( m ( x ) − m n ( x ; S n )) i > ˜ Ω C · r · C · r + r log ( d / δ ) n !! ≤ δ .6 . Under the submodularity Assumption 3.1, the independence of features Assumption 2.2 and assum-ing that m is r-sparse, if log ( t ) = r then it holds that P D n ∼D n E x ∼D x h ( m ( x ) − m n ( x ; S n )) i > ˜ Ω C · r r · log ( d / δ )) n !! ≤ δ .

3. If m is ( β , r ) -strongly sparse as per Assumption 3.2, and n ≥ ˜ Ω (cid:16) r log ( d / δ ) β (cid:17) , and we set log ( t ) = r, then we have P D n ∼D n (cid:18) E x ∼D x h ( m ( x ) − m n ( x ; S n )) i > ˜ Ω (cid:18) r log ( d / δ ) log ( n ) n (cid:19)(cid:19) ≤ δ .As we can see the rates, naturally, are better as we make our assumptions stronger. Thefastest rate is achievable when the ( β , r ) strong sparsity holds (even without the submodularityor the independence condition), the second fastest rate when the features are independent butonly the submodularity holds, and we have the slowest rate when only the submodularity holdsand there is arbitrary correlation between the features. In this section we consider the case of fully grown honest trees. For this case it is necessary toconsider forest estimators instead of trees because a fully grown tree has very high variance. Forthis reason we use the subsampling technique and honesty . For any subset D s of size s of the setof samples D n , we build one tree estimator m ( · ; D s ) according to Algorithm 1 with inputs, log ( t ) large enough so that every leaf has two or three samples (fully grown tree), training set D s and h =

1. Then our ﬁnal estimator m n , s can be computed as follows m n , s ( x ) = ( ns ) ∑ D s ⊆ D n , | D s | = s E ω [ m ( x ; D s )] . (3.6)Where ω is any internal randomness to the tree building algorithm (e.g. the sample splitting).We note that even though we phrase results for the latter estimator, where all sub-samples arebeing averaged over and the expectation over the randomness ω is computed, our results carryover to the monte-carlo approximation of this estimator, where only B trees are created, eachon a randomly drawn sub-sample and for a random draw of the randomness (see e.g. [WA18,OSW19]), assuming B is large enough (for our guarantees to hold it sufﬁces to take B = Θ ( d n ) ).For the estimator m n , s and under the strong sparsity Assumption 3.2 we have the followingconsistency and asymptotic normality theorems. The proofs of the theorems are presented inAppendices D.3 and F. Theorem 3.4.

Let D n be i.i.d. samples from the non-parametric regression model y = m ( x ) + ε , wherem ( x ) ∈ [ − ] , ε ∼ E , E ε ∼E [ ε ] = and ε ∈ [ − ] . Let m n , s be the forest estimator thatis built with sub-sampling of size s from the training set and where every tree m ( x ; D s ) is built usingAlgorithm 1, with inputs: log ( t ) large enough so that every leaf has two or three samples and h = .Under the Assumption 3.2, if R is the set of relevant features and and for every w ∈ {

0, 1 } r it holds forthe marginal probability that P z ∼D x ( z R = w ) ( ζ /2 r ) and if s = ˜ Θ (cid:16) r ( log ( d / δ )) β + r log ( δ ) ζ (cid:17) thenit holds that P D n ∼D n (cid:18) E x ∼D x [( m n , s ( x ) − m ( x )) ] ≥ ˜ Ω (cid:18) r log ( δ ) n (cid:18) log ( d ) β + ζ (cid:19)(cid:19)(cid:19) ≤ δ .7ur next goal is to prove the asymptotic normality of the estimate m n , s . To do so we needthat our estimation algorithm treats samples, a-priori symmetrically (i.e. the estimate is invariantto permutations of the sample indices). Since for simplicity, we have presented m n , s based on adeterministic algorithm, this might be violated. For this reason, for the normality result, beforecomputing the m n , s we apply a random permutation τ ∈ S n in the training set D n . The permuta-tion τ is part of the internal randomness ω of the algorithm. Given the permutation τ we denoteestimate that we compute by m n , s , τ . Ideally we would like to compute the expected value of m n , s , τ over a uniform choice of τ which we denote by m n , s . However this is computationally veryexpensive since we need to repeat the estimate for all the n ! permutations. Instead we computea Monte Carlo approximation of m n , s by sampling B permutations from S n and then taking theempirical average of those. We denote this estimator as m n , s , B . Theorem 3.5.

Under the same conditions of Theorem 3.4 and with the further assumption that: σ ( x ) = Var ( y ( i ) | x ( i ) = x ) ≥ σ > and that for a priori ﬁxed x it holds P z ∼D x ( z R = x R ) ≥ ζ /2 r , if we set: ˜ Θ (cid:16) r ( log ( d )+ log ( n )) β + r log ( n ) ζ (cid:17) ≤ s ≤ o ( √ n ) , then for σ n ( x ) = O (cid:16) s n (cid:17) , it holds that: σ − n ( x ) ( m n , s , B ( x ) − m ( x )) → d N (

0, 1 ) (3.7) where B ≥ n log ( n ) . The latter asymptotic normality theorem enables the construction of asymptotically validnormal based intervals, using estimates of the variance of the prediction. These estimates canbe constructed either via the bootstrap or via methods particular to random forests such as theinﬁnitesimal jacknife proposed in [WHE14].

In this section we present our analysis for the case when the tree construction algorithm, at everyiteration, chooses a different direction to split on at every cell of the current partition. We startwith the necessary deﬁnitions to present the algorithm that we use. We refer to Appendix A.2for a presentation and analysis of the population version of the algorithm, that contains somerelevant deﬁnitions and lemmas used in our main proof.We deﬁne the total expected mean squared error that is achieved by a partition P of {

0, 1 } d , inthe population model, as follows: L ( P ) , E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P ( x )] (cid:19) (4.1) = E x (cid:2) m ( x ) (cid:3) − E x ∼D x "(cid:18) E z ∼D x [ m ( z ) | z ∈ P ( x )] (cid:19) , E x (cid:2) m ( x ) (cid:3) − V ( P ) . (4.2)For simplicity, we use the shorthand notation V ( P , A , i ) and L ( P , A , i ) to denote V ( S ( P , A , i )) and L ( S ( P , A , i )) . Observe that the function L is increasing with respect to P is the sense that if P ′ ⊑ P then L ( P ′ ) ≤ L ( P ) and hence V is decreasing with respect to P .8n order to deﬁne the splitting criterion of our algorithm, we will need to deﬁne a local versionof the mean squared error, locally at a cell A , as follows: L ℓ ( A , P ) , E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P ( x )] (cid:19) | x ∈ A (4.3) = E x ∼D x (cid:2) m ( x ) | x ∈ A (cid:3) − E x h E z [ m ( z ) | z ∈ P ( x )] | x ∈ A i , E x (cid:2) m ( x ) | x ∈ A (cid:3) − V ℓ ( A , P ) . (4.4)For shorthand notation, for any P that contains A , we will use V ℓ ( A , I ) = V ℓ ( A , S ( P , A , I )) , forany set of directions I ⊆ [ d ] (observe that the quantity is independent of the choice of P , as longas it contains A ). Similarly, we will use the shorthand notation L ℓ ( A , I ) . Finally, we will use theshorthand notation: L ℓ ( A ) , V ℓ ( A ) for L ℓ ( A , ∅ ) and V ℓ ( A , ∅ ) correspondingly.We now need to deﬁne the corresponding property of submodularity and strong sparsityin this more complicated setting. Inspired by the economics literature we call the analogue ofsubmodularity for this setting the diminishing returns property. Moreover, we call the analogue ofstrong sparsity, strong partition sparsity. Assumption 4.1 (A pproximate D iminishing R eturns ) . For C ≥ , we say that the function V hasthe approximate diminishing returns property if for any cells A, A ′ , any i ∈ [ d ] and any T ⊆ [ d ] suchthat A ′ ⊆ A it holds that V ℓ ( A ′ , T ∪ { i } ) − V ℓ ( A ′ , T ) ≤ C · ( V ℓ ( A , i ) − V ℓ ( A )) . Assumption 4.2 (S trong P artition S parsity ) . A target function m : {

0, 1 } d → [ − ] is ( β , r ) -strongly partition sparse if m is r-sparse with relevant features R and the function V satisﬁes:V ℓ ( A , T ∪ j ) − V ℓ ( A , T ) + β ≤ V ℓ ( A , T ∪ i ) − V ℓ ( A , T ) , for all possible cells A and for all i ∈ R ,j ∈ [ d ] \ R . For some of the results in this section we need to assume that the density or the marginaldensity with respect to x is lower bounded by some constant. The reason that we need thisassumption is that the greedy decision made by the Algorithm 2 are separate for every leaf ofthe tree. Therefore we need to make sure that, at least in the ﬁrst important splits, every leafhas enough samples to choose the correct greedy option. For this reason we deﬁne the followingassumption on the lower bound of the marginal density. Assumption 4.3 (M arginal D ensity L ower B ound ) . We say that the density D x is ( ζ , q ) -lowerbounded, if for every set Q ⊂ [ d ] with size | Q | = q then for every w ∈ {

0, 1 } q it holds that P x ∼D x ( x Q = w ) ≥ ζ /2 q . We next need to deﬁne the estimator that is deﬁned by a tree that produces a partition P ofthe space {

0, 1 } d . Given a training set D n and a cell A ∈ P , we deﬁne: g n ( A ) = N n ( A ) ∑ j ∈ [ n ] y ( j ) · { x ( j ) ∈ A } = ∑ j ∈ [ n ] W n ( x ( j ) ; A ) · y ( j ) , (4.5)where in both the aforementioned deﬁnition N n ( · ) and W n ( · ; · ) are deﬁned as follows N n ( A ) = ∑ j ∈ [ n ] { x ( j ) ∈ A } , W n ( x ; A ) = { x ∈ A } N n ( A ) (4.6)9n words, the function N n ( A ) is the number of training point in the cell A and W n ( x ; A ) is thecoefﬁcient of the training points that lie in the cell A , when computing the local estimate at A .We also deﬁne the set Z n ( A ) , as the subset of the training set Z n ( A ) = { j | x ( j ) ∈ A } . Basedon this we also deﬁne the partition U n ( P ) of the training set D n as U n ( P ) = {Z n ( A ) | A ∈ P } .Given an input x we deﬁne the estimate m ( x ; P , D n ) as follows (for simplicity, we use m n ( · ; · ) , N n ( · ) and W n ( · ; · ) instead of m ( · ; · , D n ) , N ( · ; D n ) and W ( · ; · , D n ) ): m n ( x ; P ) = g n ( P ( x )) , (4.7)For the presentation of the algorithm we also need the deﬁnition of the empirical mean squarederror , conditional on a cell A and a potential split direction i , as follows. L ℓ n ( A , i ) , ∑ z ∈{ } N n ( A iz ) N n ( A ) ∑ j ∈Z n ( A iz ) N n ( A iz ) (cid:16) y ( j ) − m n (cid:16) x ( j ) ; P (cid:16) x ( j ) (cid:17)(cid:17)(cid:17) (4.8) = N n ( A ) ∑ j ∈ Z n ( A ) (cid:16) y ( j ) (cid:17) − ∑ z ∈{ } N n ( A iz ) N n ( A ) (cid:16) g n ( A iz ) (cid:17) , N n ( A ) ∑ j ∈ Z n ( A ) (cid:16) y ( j ) (cid:17) − V ℓ n ( A , i ) , (4.9)We are now ready to present the Breiman’s tree construction algorithm both with and withouthonesty (we use the honesty ﬂag h , where h = Algorithm 2:

Breiman’s Tree Construction Algorithm

Input: maximum number of nodes t , a training data set D n , honesty ﬂag h . Output: tree approximation of m . V ← D n , x if h = then Split randomly D n in half, D n /2 , D ′ n /2 , set n ← n /2, set V ← D ′ n , x ; Set P = {{

0, 1 } d } the partition associated with the root of the tree. For all 1 ≤ ℓ ≤ t , set P ℓ = ∅ . level ← n nodes ← queue ← P . while n nodes < t do if queue = ∅ then level ← level + queue ← P level Pick A the ﬁrst element in queue if |V ∩ A | ≤ then queue ← queue \ { A } , P level + ← P level + ∪ { A } else Select i ∈ [ d ] that maximizes V ℓ n ( A , i ) [ see (4.9)] Cut the cell A to cells A ik = { x | x ∈ A ∧ x i = k } , k =

0, 1 queue ← queue \ { A } , P level + ← P level + ∪ { A i , A i } end end P level + ← P level + ∪ queue return ( P n , m n ) = ( P level + , x m n ( x ; P level + )) [ see (4.7)]10e can now state our main result for the consistency of shallow trees with Breiman’s splitsas described in Algorithm 2. The proof of this theorem can be found in the Appendix E. As wecan see in Theorem 4.4 the rates are better as we make our assumptions stronger similar to theresults for the level-split algorithm. The main difference between the results in this section andthe results for the level-split algorithm is that for the analysis of Breiman’s algorithm we need toassume that the probability mass function of the distribution D x is lower bounded by ζ /2 d . Theorem 4.4.

Let D n be i.i.d. samples from the non-parametric regression model y = m ( x ) + ε , wherem ( x ) ∈ [ − ] , ε ∼ E , E ε ∼E [ ε ] = and ε ∈ [ − ] with m an r-sparse function. . Let also P n be the partition that the Algorithm 2 returns with input h = . Then the following statements hold.1. Let q = C · rC · r + ( log ( n ) − log ( log ( d / δ ))) and assume that the approximate diminishing returnsAssumption 4.1 holds. Moreover if we set the number of nodes t such that log ( t ) ≥ q, and if wehave number of samples n ≥ ˜ Ω ( log ( d / δ )) then it holds that P D n ∼D n E x ∼D x h ( m ( x ) − m n ( x ; P n )) i > ˜ Ω C · r · C · r + r log ( d / δ ) n !! ≤ δ .

2. Suppose that the distribution D x is a product distribution (see Assumption 2.2) and that the As-sumption 4.1 holds. Moreover if log ( t ) ≥ r, then it holds that P D n ∼D n E x ∼D x h ( m ( x ) − m n ( x , P n )) i > ˜ Ω r C · r · log ( d / δ )) n !! ≤ δ .

3. Suppose that the distribution D x is a product distribution (see Assumption 2.2), that is also ( ζ , r ) -lower bounded (see Assumption 4.3) and that the Assumption 4.1 holds. Moreover if log ( t ) ≥ r,then it holds that P D n ∼D n E x ∼D x h ( m ( x ) − m n ( x , P n )) i > ˜ Ω C · s r · log ( d / δ )) ζ · n !! ≤ δ .

4. Suppose that m is ( β , r ) -strongly sparse (see Assumption 4.2) and that D x is ( ζ , r ) -lower bounded(see Assumption 4.3). If n ≥ ˜ Ω (cid:16) r ( log ( d / δ )) ζ · β (cid:17) , and log ( t ) ≥ r, then we have P D n ∼D n (cid:18) E x ∼D x h ( m ( x ) − m n ( x , P n )) i > ˜ Ω (cid:18) r log ( d / δ ) log ( n ) n (cid:19)(cid:19) ≤ δ . In this section we consider the case of fully grown honest trees. As in the case of level splits weare going to use the subsampling technique and honesty. That is, for any subset D s of size s ofthe set of samples D n , we build one tree estimator m ( · ; D s ) according to Algorithm 2 with inputs,log ( t ) large enough so that every leaf has two or three samples, training set D s and h =

1. Thenour ﬁnal estimator m n , s can be computed as follows m n , s ( x ) = ( ns ) ∑ D s ⊆ D n , | D s | = s E ω [ m ( x ; D s )] . (4.10)Where ω is the internal randomness of the tree building algorithm. For this estimator m n , s and under the strong partition sparsity Assumption 4.2 we have the following consistency and11symptotic normality theorems. The proof of the following theorems is presented in Appen-dices E.3 and F. Theorem 4.5.

Let D n be i.i.d. samples from the non-parametric regression model y = m ( x ) + ε , wherem ( x ) ∈ [ − ] , ε ∼ E , E ε ∼E [ ε ] = and ε ∈ [ − ] . Suppose that D x is ( ζ , r ) -lower bounded(see Assumption 4.3). Let m n , s be the forest estimator that is built with sub-sampling of size s from thetraining set and where every tree m ( x ; D s ) is built using the Algorithm 2, with inputs: log ( t ) large enoughso that every leaf has two or three samples, training set D s and h = . Then using s = ˜ Θ (cid:16) r ( log ( d / δ )) ζ · β (cid:17) and under Assumption 4.2: P D n ∼D n (cid:18) E x ∼D x [( m ( x ) − m n , s ( x )) ] > ˜ Ω (cid:18) r log ( d / δ ) n · ζ · β (cid:19)(cid:19) ≤ δ .Our next goal is to prove the asymptotic normality of the estimate m n , s . As we have al-ready discussed for the level-splits algorithm, to prove the asymptotic normality we need thatour estimation algorithm treats samples, a priori symmetrically (i.e. the estimate is invariant topermutations of the sample indices). Since for simplicity, we have presented m n , s based on adeterministic algorithm, this might be violated. For this reason, for the normality result, beforecomputing the m n , s we apply a random permutation τ ∈ S n in the training set D n . The permuta-tion τ is part of the internal randomness ω of the algorithm. Given the permutation τ we denoteestimate that we compute by m n , s , τ . Ideally we would like to compute the expected value of m n , s , τ over a uniform choice of τ which we denote by m n , s . However this is computationally veryexpensive since we need to repeat the estimate for all the n ! permutations. Instead we computea Monte Carlo approximation of m n , s by sampling B permutations from S n and then taking theempirical average of those. We denote this estimator as m n , s , B . Theorem 4.6.

Under the same conditions of Theorem 4.5 and with the further assumption that: σ ( x ) = Var ( y ( i ) | x ( i ) = x ) ≥ σ > and that for a priori ﬁxed x it holds P z ∼D x ( z R = x R ) ≥ ζ /2 r , if we set: ˜ Θ (cid:16) r ( log ( d n )) ζ · β (cid:17) ≤ s ≤ o ( √ n ) , then for σ n ( x ) = O (cid:16) s n (cid:17) it hold that σ − n ( x ) ( m n , s , B ( x ) − m ( x )) → d N (

0, 1 ) . (4.11) where B ≥ n log ( n ) . Acknowledgements

This research was completed while MZ was a intern at Microsoft Research, New England. MZ isalso supported from a Google Ph.D. Fellowship.

References [AG14] Sylvain Arlot and Robin Genuer. Analysis of purely random forests bias. arXive-prints , page arXiv:1407.3939, Jul 2014.[BBM02] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Localized rademachercomplexities. In Jyrki Kivinen and Robert H. Sloan, editors,

Computational LearningTheory , pages 44–58, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.12BFOS84] Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. Classiﬁcationand regression trees. wadsworth int.

Group , 37(15):237–251, 1984.[Bia10] Gérard Biau. Analysis of a random forests model.

Journal of Machine Learning Research ,13, 05 2010.[Bre01] Leo Breiman. Random forests.

Machine learning , 45(1):5–32, 2001.[Bre04] Leo Breiman. Consistency for a simple model of random forests. statistical depart-ment.

University of California at Berkeley. Technical Report,(670) , 2004.[CD12] Laëtitia Comminges and Arnak S Dalalyan. Tight conditions for consistency ofvariable selection in the context of high dimensionality.

The Annals of Statistics ,40(5):2667–2696, 2012.[DMdF14] Misha Denil, David Matheson, and Nando de Freitas. Narrowing the gap: Ran-dom forests in theory and in practice. In

International Conference on Machine Learning(ICML) , 2014.[DS16] Roxane Duroux and Erwan Scornet. Impact of subsampling and pruning on randomforests. arXiv e-prints , page arXiv:1603.04261, Mar 2016.[FLW18] Yingying Fan, Jinchi Lv, and Jingbo Wang. DNN: A Two-Scale Distributional Taleof Heterogeneous Treatment Effect Inference. arXiv e-prints , page arXiv:1808.08469,Aug 2018.[Fri91] JH Friedman. Multivariate adaptive regression splines (with discussion).

Ann. Statist ,19(1):79–141, 1991.[FS19] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv preprintarXiv:1901.09036 , 2019.[GKKW06] László Györﬁ, Michael Kohler, Adam Krzyzak, and Harro Walk.

A distribution-freetheory of nonparametric regression . Springer Science & Business Media, 2006.[GM97] Edward I George and Robert E McCulloch. Approaches for bayesian variable selec-tion.

Statistica sinica , pages 339–373, 1997.[Hoe94] Wassily Hoeffding. Probability inequalities for sums of bounded random variables.In

The Collected Works of Wassily Hoeffding , pages 409–426. Springer, 1994.[HTF09] Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

The elements of statisticallearning: data mining, inference, and prediction . Springer Science & Business Media,2009.[LC09] Han Liu and Xi Chen. Nonparametric greedy algorithms for the sparse learningproblem. In

Advances in Neural Information Processing Systems , pages 1141–1149, 2009.[LJ02] Yi Lin and Yongho Jeon. Random forests and adaptive nearest neighbors.

JOURNALOF THE AMERICAN STATISTICAL ASSOCIATION , pages 101–474, 2002.13LN96] Gábor Lugosi and Andrew Nobel. Consistency of data-driven histogram methodsfor density estimation and classiﬁcation.

Ann. Statist. , 24(2):687–706, 04 1996.[LW +

08] John Lafferty, Larry Wasserman, et al. Rodeo: sparse, greedy nonparametric regres-sion.

The Annals of Statistics , 36(1):28–63, 2008.[Mei06] Nicolai Meinshausen. Quantile regression forests.

Journal of Machine Learning Re-search , 7:983–999, 06 2006.[MH16] Lucas Mentch and Giles Hooker. Quantifying uncertainty in random forests via con-ﬁdence intervals and hypothesis tests.

Journal of Machine Learning Research , 17(26):1–41, 2016.[MM00] Yishay Mansour and David A. McAllester. Generalization bounds for decision trees.In

Proceedings of the Thirteenth Annual Conference on Computational Learning Theory ,COLT ’00, page 69–74, San Francisco, CA, USA, 2000. Morgan Kaufmann PublishersInc.[Nob96] Andrew Nobel. Histogram regression estimation using data-dependent partitions.

Ann. Statist. , 24(3):1084–1105, 06 1996.[OSW19] Miruna Oprescu, Vasilis Syrgkanis, and Zhiwei Steven Wu. Orthogonal randomforest for causal inference. In

International Conference on Machine Learning , pages4932–4941, 2019.[PAR10] Thomas Peel, Sandrine Anthoine, and Liva Ralaivola. Empirical bernstein inequali-ties for u-statistics. In

Advances in Neural Information Processing Systems , pages 1903–1911, 2010.[SB01] Alex J Smola and Peter L Bartlett. Sparse greedy gaussian process regression. In

Advances in neural information processing systems , pages 619–625, 2001.[SBV15] Erwan Scornet, Gérard Biau, and Jean-Philippe Vert. Consistency of random forests.

Ann. Statist. , 43(4):1716–1741, 08 2015.[Sco16] Erwan Scornet. On the asymptotics of random forests.

Journal of Multivariate Analysis ,146:72 – 83, 2016. Special Issue on Statistical Models and Methods for High or InﬁniteDimensional Spaces.[VW96] A. W. Van Der Vaart and J. A. Wellner.

Weak Convergence and Empirical Processes: WithApplications to Statistics . Springer Series, March 1996.[WA18] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treat-ment effects using random forests.

Journal of the American Statistical Association ,113(523):1228–1242, 2018.[Wai19] Martin J. Wainwright.

High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cam-bridge Series in Statistical and Probabilistic Mathematics. Cambridge UniversityPress, 2019. 14WHE14] Stefan Wager, Trevor Hastie, and Bradley Efron. Conﬁdence intervals for randomforests: The jackknife and the inﬁnitesimal jackknife.

Journal of Machine LearningResearch , 15:1625–1651, 2014.[WW15] Stefan Wager and Guenther Walther. Adaptive Concentration of Regression Trees,with Application to Random Forests. arXiv e-prints , page arXiv:1503.06388, Mar 2015.[YT15] Yun Yang and Surya T Tokdar. Minimax-optimal nonparametric regression in highdimensions.

The Annals of Statistics , 43(2):652–674, 2015.15

Population Models

In this section we present the population versions of the algorithms that we analyze in the mainpart of the paper, together with their analysis of convergence. We suggest to the reader to startstudying these results ﬁrst and then read the complete ﬁnite sample analysis.

A.1 Population Algorithm using Level-Splits

We start with the presentation of the level-splits algorithm in the population model.

Algorithm 3:

Level Split Algorithm – Population Model

Input: maximum number of splits log ( t ) . Output: tree approximation of m . Set P = {{

0, 1 } d } the partition associated with the root of the tree. level ← − S ← ∅ . while level < log ( t ) do level ← level + P level + = ∅ . Select i ∈ [ d ] that maximizes V ( S ∪ { i } ) ( see (3.2)). for all A ∈ P level do Cut the cell A to cells A ik = { x | x ∈ A ∧ x i = k } , k =

0, 1. P level + ← P level + ∪ { A i , A i } end S ← S ∪ { i } end return ( P , m ) = (cid:16) P level , x E ( z , y ) ∼D (cid:2) y | z ∈ P ( x ) (cid:3)(cid:17) Deﬁnition A.1 (R elevant V ariables ) . Given a set S , we deﬁne the set of remaining relevantfeatures R ( S ) = { i ∈ [ d ] | V ( S ∪ { i } ) > V ( S ) } . Lemma A.2.

For every set S ⊆ [ d ] , under the Assumption 3.1, if R ( S ) = ∅ , then for any x , x ′ ∈ {

0, 1 } d such that x S = x S ′ it holds that m ( x ) = m ( x ′ ) .Proof. We prove this by contradiction. If there exist x , x ′ ∈ {

0, 1 } d such that x S = x ′ S and m ( x ) = m ( x ′ ) then obviously there exists an ˜ x ∈ A such that m ( ˜ x ) = E x − S [ m (( x S , x − S ))] , where A isthe cell of the input space that contains all vectors z with z S = x S . Therefore it holds that L ( S ) > L ([ d ]) = V ([ d ]) > V ( S ) . Now let’s assume an arbitrary enumeration { i , . . . , i k } of the set S c = [ d ] \ S . Because the function V is monotone and V ([ d ]) > V ( S ) , therehas to be a number j ∈ [ k ] such that V ( S ∪ { i , . . . , i j } ) > V ( S ∪ { i , . . . , i j − } ) . But because ofthe approximate submodularity of V it holds that V ( S ∪ { i , . . . , i j } ) − V ( S ∪ { i , . . . , i j − } ) ≤ C · ( V ( S ∪ { i j } ) − V ( S )) , which implies that V ( S ∪ { i j } ) > V ( S ) and this contradicts our assumptionthat R ( S ) = ∅ . Theorem A.3.

Consider the non-parametric regression model y = m ( x ) + ε , where m : {

0, 1 } d → (cid:2) − , (cid:3) is a r-sparse function and ε ∼ E , with ε ∈ (cid:2) − , (cid:3) and E [ ε ] = . Let m be the function that the lgorithm 3 returns with input t ≥ ( η ) C · r , then under the Assumption 3.1 it holds that E x h ( m ( x ) − m ( x )) i ≤ η . Moreover under the Independence of Features Assumption 2.2 if log ( t ) ≥ r then m = m.Proof. Let R ⊆ [ d ] be the set of size | R | = r of the relevant features of the target function m . Let S the set of splits that Algorithm 3 chooses. Observe that it holds that L ( S ∪ R ) = V ( S ∪ R ) , V ∗ is maximized. Since m ( x ) ∈ [ − ] , the maximum value of V is 1.For the ﬁrst part of the theorem, let { i , . . . , i r } and be an arbitrary enumeration of R andlet R j = { i , . . . , i j } then by adding and subtracting terms of the form V ( S ∪ R j ) we have thefollowing equality (cid:0) V ( S ∪ R ) − V ( S ∪ R r − ) (cid:1) + · · · + (cid:0) V ( S ∪ R ) − V ( S ∪ { i } ) (cid:1) + V ( S ∪ { i } ) = V ∗ .From the approximate submodularity of V we hence have that (cid:0) V ( S ∪ { i r } ) − V ( S ) (cid:1) + · · · + (cid:0) V ( S ∪ { i } ) − V ( S ) (cid:1) + (cid:0) V ( S ∪ { i } ) − V ( S ) (cid:1) ≥ V ∗ − V ( S ) C which implies max j ∈ [ r ] (cid:0) V ( S ∪ { i j } ) − V ( S ) (cid:1) ≥ V ∗ − V ( S ) C · r .Let i level be the coordinate that the algorithm chose to split at level level. Now from the greedycriterion that Algorithm 3 uses we get that the coordinate i level that we picked to split was at leastas good as the best of the coordinates in R , hence it holds that V (cid:16) S ∪ { i level } (cid:17) ≥ V ( S ) + V ∗ − V ( S ) C · r which in turn using L ∗ , L ( S ∪ R ) = L (cid:16) S ∪ { i level } (cid:17) ≤ L ( S ) (cid:18) − C · r (cid:19) . (A.1)Again we ﬁx S level to be the set of splits after the step level of Algorithm 3, it holds that L ( S level + ) ≤ L ( S level ) (cid:18) − C · r (cid:19) .Inductively and using the fact that m ( x ) ∈ [ − ] implies that L ( S level ) ≤ L ( ∅ ) (cid:18) − C · r (cid:19) level ≤ (cid:18) − C · r (cid:19) level . (A.2)Finally from the choice of t we have that for level = C · r ln ( η ) it holds L ( S level ) ≤ η and since L ( S ) is a decreasing function of S the ﬁrst part of the theorem follows.For the second part, we observe that for any coordinate i ∈ [ d ] \ R and for any S ⊆ [ d ] it holdsthat V ( S ∪ { i } ) − V ( S ) = [ d ] \ R only afterit picks all the coordinates in R . Hence for log ( t ) ≥ r we have that R ( S ) = ∅ and from LemmaA.2 the second part of the theorem follows. 17 .2 Population Algorithm of Breiman’s Algorithm In this section we present the analysis of Breiman’s algorithm in the population model, deﬁnedin Algorithm 4.

Algorithm 4:

Breiman’s Tree Construction Algorithm – Population Model

Input: maximum number of nodes t . Output: tree approximation of m . Set P = {{

0, 1 } d } the partition associated with the root of the tree. level ← n nodes ← queue ← P . while n nodes < t do if queue = ∅ then level ← level + queue ← P level end Pick A the ﬁrst element in queue Select i ∈ [ d ] that maximizes V ℓ ( A , i ) [ see (4.4)] Cut the cell A to cells A ik = { x | x ∈ A ∧ x i = k } , k =

0, 1 queue ← queue \ { A } , P level + ← P level + ∪ { A i , A i } end P level + ← P level + ∪ queue . return ( P , m ) = (cid:16) P level + , x E ( z , y ) ∼D [ y | z ∈ P level + ( x )] (cid:17) We now prove some important properties of the functions V , V ℓ , L and L ℓ as presented inEquations (3.1), (3.2), (4.3) and (4.4). Lemma A.4.

For any partition P and any cell A ∈ P the following hold1. V ( P ) = ∑ A ∈P P x ( x ∈ A ) · V ℓ ( A ) ,2. L ( P ) = ∑ A ∈P P x ( x ∈ A ) · L ℓ ( A ) ,3. V ( P , A , i ) − V ( P ) = P x ( x ∈ A ) · (cid:0) V ℓ ( A , i ) − V ℓ ( A ) (cid:1) ,4. Under Assumption 4.1 for any two partitions P ′ ⊑ P and any cells A, A ′ , such that A ′ ⊆ A andA ′ ∈ P ′ , A ∈ P , it holds that V ( P ′ , A ′ , i ) − V ( P ′ ) ≤ C · ( V ( P , A , i ) − V ( P )) .5. Under Assumption 4.1 for any two partitions P ′ ⊑ P and any cells A, A ′ , such that A ′ ⊆ A andA ′ ∈ P ′ , A ∈ P and for any T ⊆ [ d ] , i ∈ [ d ] it holds that V ( P ′ , A ′ , T ∪ { i } ) − V ( P ′ , A ′ , T ) ≤ C · ( V ( P , A , i ) − V ( P )) .Proof. The equations (1.), (2.) follow from the deﬁnitions of V , V ℓ , L , L ℓ . For equation 3. we have V ( P , A , i ) − V ( P ) = P x (cid:16) x ∈ A i (cid:17) · V ℓ ( A i ) + P x (cid:16) x ∈ A i (cid:17) · V ℓ ( A i ) − P x ( x ∈ A ) · V ℓ ( A ) V ℓ ( A , i ) − V ℓ ( A ) = P x (cid:16) x ∈ A i | x ∈ A (cid:17) · V ℓ ( A i ) + P x (cid:16) x ∈ A i | x ∈ A (cid:17) · V ℓ ( A i ) − V ℓ ( A ) and therefore we have that V ( P , A , i ) − V ( P ) = P x ( x ∈ A ) · (cid:0) V ℓ ( A , i ) − V ℓ ( A ) (cid:1) and equation (3.) follows. Now from Assumption 4.1 we have that V ℓ ( A ′ , i ) − V ℓ ( A ′ ) ≤ C · (cid:0) V ℓ ( A , i ) − V ℓ ( A ) (cid:1) = ⇒ x ( x ∈ A ) · (cid:0) V ℓ ( A ′ , i ) − V ℓ ( A ′ ) (cid:1) ≤ P x ( x ∈ A ) · C · (cid:0) V ℓ ( A , i ) − V ℓ ( A ) (cid:1) but now since A ′ ⊆ A this implies P x ( x ∈ A ′ ) ≤ P x ( x ∈ A ) and hence P x (cid:0) x ∈ A ′ (cid:1) · (cid:0) V ℓ ( A ′ , i ) − V ℓ ( A ′ ) (cid:1) ≤ P x ( x ∈ A ) · C · (cid:0) V ℓ ( A , i ) − V ℓ ( A ) (cid:1) combining the last inequality with equation 3. we get a proof of equation (4.). The statement in(5.) can be proven in an identical manner to (4.). Deﬁnition A.5.

Given a cell A , we deﬁne the set R ( A ) = { i ∈ [ d ] | V ℓ ( A , i ) > V ℓ ( A ) } . We alsodeﬁne the set I ( A ) = { i ∈ [ d ] | A i ⊂ A } and O ( A ) = [ d ] \ I ( A ) . Lemma A.6.

For every partition P , under the Assumption 4.1, if for every A ∈ P it holds that R ( A ) = ∅ , then for any B ∈ P , with P x ( x ∈ B ) > , and x , x ′ ∈ B it holds that m ( x ) = m ( x ′ ) .Proof. We prove this by contradiction. Let B ∈ P , if there exist x , x ′ ∈ B such that m ( x ) = m ( x ′ ) then obviously there exists a ˜ x ∈ B such that m ( ˜ x ) = E x [ m ( x ) | x ∈ B ] . Therefore it holdsthat L ( P ) > L ( P , B , [ d ]) and hence V ( P , B , [ d ]) > V ( P ) . Now let’s assume an arbitrary enu-meration { i , . . . , i k } of the set I ( B ) . Because the function V is decreasing with respect to P and V ( P , B , [ d ]) > V ( P ) , there has to be a number j ∈ [ k ] such that V ( P , B , { i , . . . , i j } ) > V ( P , B , { i , . . . , i j − } ) . But because of Assumption 4.1 of V ℓ and Lemma A.4 it holds that0 < V ( P , B , { i , . . . , i j } ) − V ( P , B , { i , . . . , i j − } ) ≤ C · (cid:0) V ( P , B , i j ) − V ( P , B ) (cid:1) ,by Lemma A.4 we have that V ℓ ( B , i j ) > V ℓ ( B ) and together with P x ∼D x ( x ∈ B ) > R ( B ) = ∅ . Theorem A.7.

Consider the non-parametric regression model y = m ( x ) + ε , where m : {

0, 1 } d → (cid:2) − , (cid:3) is a r-sparse function and ε ∼ E , with ε ∈ (cid:2) − , (cid:3) and E [ ε ] = . Let m be the function that theAlgorithm 4 returns with input t ≥ ( η ) C · r , then under the Assumption 4.1 it holds that E x h ( m ( x ) − m ( x )) i ≤ η . Also, under the Independence of Features Assumption 2.2 if t ≥ r then P x ∼D x ( m ( x ) = m ( x )) = .Proof. When the value of level changes, then the algorithm considers separately every cell A in P level . For every such cell A it holds that L ℓ ( A , R ) = V ℓ ( A , R ) , V ∗ ( A ) ismaximized. Since m ( x ) ∈ [ −

1, 1 ] it holds that the maximum value of V ℓ is 1. Now let { i , . . . , i r } be an arbitrary enumeration of R and let R j = { i , . . . , i j } then by adding and subtracting termsof the form V ℓ ( A , R j ) we have the following equality (cid:0) V ℓ ( A , R ) − V ℓ ( A , R r − ) (cid:1) + · · · + (cid:0) V ℓ ( A , R ) − V ℓ ( A , i ) (cid:1) + V ℓ ( A , i ) = V ∗ ( A ) .From Assumption 4.1 we have that (cid:0) V ℓ ( A , i r ) − V ℓ ( A ) (cid:1) + · · · + (cid:0) V ℓ ( A , i ) − V ℓ ( A ) (cid:1) + (cid:0) V ℓ ( A , i ) − V ℓ ( A ) (cid:1) ≥ V ∗ ( A ) − V ℓ ( A ) C which implies max j ∈ [ r ] (cid:0) V ℓ ( A , i j ) − V ℓ ( A ) (cid:1) ≥ V ∗ ( A ) − V ℓ ( A ) C · r .19et i level A be the coordinate that the algorithm chose to split cell A at level level. Now from thegreedy criterion that we use to pick the next coordinate to split in Algorithm 4 we get that forthe coordinate i level A that we picked to split A was at least as good as the best of the coordinatesin R , hence it holds that V ℓ (cid:16) A , i level A (cid:17) ≥ V ℓ ( A ) + V ∗ ( A ) − V ℓ ( A ) C · r which in turn because L ∗ ( A ) , L ℓ ( A , R ) = L ℓ (cid:16) A , i level A (cid:17) ≤ L ℓ ( A ) (cid:18) − C · r (cid:19) . (A.3)Again we ﬁx Q level to be the partition P level when level changed and P level is a full partition of {

0, 1 } d . Then because of A.3 and Lemma A.4 it holds that L ( Q level + ) = ∑ A ∈Q level P x ( x ∈ A ) L ℓ (cid:16) A , i level A (cid:17) ≤ ∑ A ∈Q level L ℓ ( A ) (cid:18) − C · r (cid:19) (A.4) = L ( Q level ) (cid:18) − C · r (cid:19) . (A.5)Inductively and using the fact that m ( x ) ∈ [ −

1, 1 ] implies that L ( Q level ) ≤ L ( P ) (cid:18) − C · r (cid:19) level ≤ (cid:18) − C · r (cid:19) level . (A.6)Finally from the choice of t we have that level ≥ C · r · ln ( η ) and hence L ( Q level ) ≤ η andhence the ﬁrst part of the theorem follows.For the second part, we observe that for any coordinate i ∈ [ d ] \ R and for any cell A it holdsthat V ℓ ( A , i ) − V ℓ ( A ) = [ d ] \ R only afterit picks all the coordinates in R . Hence for t ≥ r we have that R ( A ) = ∅ for all the cells A inthe output partition and from Lemma A.6 the second part of the theorem follows.20 Bias-Variance Decomposition of Shallow Trees

In this section, we prove a bias-variance decomposition of estimators deﬁned via partitions of thefunction space; a special case of which are tree-based estimators. Moreover, we prove a bound onthe variance via an adaptation of the localized Rademacher complexity analysis, to account forpartition-based estimators (which are not necessarily global minimizers of the empirical risk).

Deﬁnition B.1.

Given a partition P = { A , . . . , A k } of {

0, 1 } d we deﬁne the set F ( P ) of piecewiseconstant functions that have the value for every set in P , i.e. F ( P ) = { m : {

0, 1 } d → [ −

1, 1 ] | ∀ A ∈ P , ∀ x , x ′ ∈ A , m ( x ) = m ( x ′ ) } .If Z = {P , . . . , P s } is a family of partitions of {

0, 1 } d , then we deﬁne F ( Z ) to be the union of F ( P ) for all P ∈ Z .For any function class G , we deﬁne the critical radius as any solution to the inequality: R ( δ ; G ) ≤ δ (B.1)where R ( δ ; G ) is the localized Rademacher complexity, deﬁned as: R ( δ ; G ) = E D n ∼D n , ε ∼ Rad n " sup g ∈ G : k g k ≤ δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s ∑ i ε i g ( x i , y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (B.2)where ε i are independent Rademacher random variables taking values equi-probably in {−

1, 1 } .Moreover, we deﬁne the star-hull of the function class as: star ( G ) = { κ g : g ∈ G , κ ∈ [

0, 1 ] } . Lemma B.2 (Bias-Variance Decomposition) . Consider a mapping P ( D n ) (for simplicity P n ), that mapsa set of training samples into a partition of the space {

0, 1 } d . Let P be the image of this mapping, i.e. theunion of P n for all possible D n . Suppose that an estimator ˆ m, minimizes the empirical mean squared erroramong all piece-wise constant functions f ∈ F ( P n ) , i.e.: ˆ m = argmin f ∈F ( P n ) ∑ i ∈ [ n ] (cid:16) y ( i ) − f ( x ) (cid:17) . (B.3) Let F = F ( P ) and let δ n ≥ Θ (cid:16) log ( log ( n )) n (cid:17) be an upper bound on the critical radius of star (cid:0) F − m (cid:1) .Moreover, let ˜ m n ( · ) = E z ∼ D x [ m ( z ) | z ∈ P n ( · )] . Then for a universal constant C, w.p. − ζ : E x ∼D x h ( ˆ m ( x ) − m ( x )) i ≤ C  δ n + r log ( ζ ) n ! + E x ∼D x h ( ˜ m n ( x ) − m ( x )) i (B.4) B.1 Proof of Lemma B.2

Notation.

To simplify the exposition we introduce here some notation that we need for our localRadermader complexity analysis. We deﬁne c ( x , y ; m ) to be represent the error of the sample ( x , y ) according to the function m . In our setting we have that c ( x , y ; m ) = ( y − m ( x )) andwe may drop the argument m from c when m is clear from the context. We deﬁne D X to bethe marginal with respect to x of the distribution of D . Also we use the notation k m k = E x ∼D X [ m ( x )] . In this section we sometimes use ˆ m is place for m n but they have the samemeaning. We next give a formal deﬁnition of the set of piece-wise constant functions. Finally, let E n denote the empirical expectation with respect to D n .For simplicity of notation, let F ( D n ) = F n , F ( P n ) and F , F ( P ) . From the deﬁnition ofˆ m we have that ∑ i ∈ [ n ] (cid:16) y ( i ) − ˆ m ( x ( i ) ) (cid:17) ≤ inf f ∈F n ∑ i ∈ [ n ] (cid:16) y ( i ) − f ( x ) (cid:17) . (B.5)Now for any function g : {

0, 1 } d → R we have that E ( x , y ) ∼D h ( y − g ( x )) − ( y − m ( x )) i = E ( x , y ) ∼D (cid:2) g ( x ) − m ( x ) − y ( g ( x ) + m ( x )) (cid:3) = E x ∼D X (cid:2) g ( x ) − m ( x ) − E [ y | x ] ( g ( x ) + m ( x )) (cid:3) = E x ∼D X (cid:2) g ( x ) + m ( x ) − m ( x ) g ( x ) (cid:3) = k g − m k . (B.6)If we plug in g = ˆ m in (B.6) then we get that k ˆ m − m k = E ( x , y ) ∼D [ c ( x , y ; ˆ m ) − c ( x , y ; m )] (B.7)We deﬁne also the following function˜ m n = argmin f ∈F n E x ∼D X [( f ( x ) − m ( x )) ] . (B.8)Observe that the solution to this optimization takes the form:˜ m n = E z ∼D x [ m ( z ) | z ∈ P n ] (B.9)Conditional on the training set D n , we have: k ˆ m − m k = E ( x , y ) ∼D [ c ( x , y ; ˆ m ) − c ( x , y ; m )]= E ( x , y ) ∼D [ c ( x , y ; ˆ m ) − c ( x , y ; ˜ m n )] + E ( x , y ) ∼D [ c ( x , y ; ˜ m n ) − c ( x , y ; m )]= E ( x , y ) ∼D [ c ( x , y ; ˆ m ) − c ( x , y ; ˜ m n )] + k ˜ m n − m k (by (B.6))Now we can relate the population generalization error with the empirical. E ( x , y ) ∼D [ c ( x , y ; ˆ m ) − c ( x , y ; ˜ m n )] = E n [ c ( x , y ; ˆ m ) − c ( x , y ; ˜ m n )]+ E ( x , y ) ∼D [ c ( x , y ; ˆ m ) − c ( x , y ; ˜ m n )] − E n [ c ( x , y ; ˆ m ) − c ( x , y ; ˜ m n )] Since by deﬁnition ˆ m minimizes the empirical loss over F n and since ˜ m n ∈ F n the ﬁrst term isnon-positive and hence E ( x , y ) ∼D [ c ( x , y ; ˆ m ) − c ( x , y ; ˜ m n )] ≤ E ( x , y ) ∼D [ c ( x , y ; ˆ m ) − c ( x , y ; ˜ m n )] − E n [ c ( x , y ; ˆ m ) − c ( x , y ; ˜ m n )]= E ( x , y ) ∼D [ c ( x , y ; ˆ m ) − c ( x , y ; m )] − E n [ c ( x , y ; ˆ m ) − c ( x , y ; m )]++ E ( x , y ) ∼D [ c ( x , y ; m ) − c ( x , y ; ˜ m n )] − E n [ c ( x , y ; m ) − c ( x , y ; ˜ m n )] F n is a subset of the space of functions F , which isindependent of D n . Thus it sufﬁces to prove a uniform convergence tail bound for all functionsin the latter space.By Lemma 7 of [FS19], we have that if δ n ≥ Θ (cid:16) log ( log ( n )) n (cid:17) is any solution to the inequality: R ( δ ; star ( F − m )) ≤ δ (B.10)then for some universal constant C , we have that with probability 1 − δ for all f ∈ F it holds that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E ( x , y ) ∼D [ c ( x , y ; f ) − c ( x , y ; m )] − n n ∑ i = c ( x i , y i ; f ) − c ( x i , y i ; m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ( δ n + ζ ) k f − m k + C ( δ n + ζ ) for ζ = q log ( δ ) n . Applying the same lemma to loss function − c , we get (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E ( x , y ) ∼D [ c ( x , y ; m ) − c ( x , y ; f )] − n n ∑ i = c ( x i , y i ; m ) − c ( x i , y i ; f ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ( δ n + ζ ) k f − m k + C ( δ n + ζ ) .Applying the ﬁrst inequality for f = ˆ m and the second for f = ˜ m n and taking a union boundover both events, we have that for ζ = q log ( δ ) n , w.p. 1 − δ : E ( x , y ) ∼D [ c ( x , y ; ˆ m ) − c ( x , y ; m )] − E n [ c ( x , y ; ˆ m ) − c ( x , y ; m )] ≤ C ( δ n + ζ ) k ˆ m − m k + C ( δ n + ζ ) E ( x , y ) ∼D [ c ( x , y ; m ) − c ( x , y , ˜ m n )] − E n [ c ( x , y ; m ) − c ( x , y , ˜ m n )] ≤ C ( δ n + ζ ) k ˜ m n − m k + C ( δ n + ζ ) Combining all these we have, w.p. 1 − δ over the training set: k ˆ m − m k ≤ C ( δ n + ζ )( k ˆ m − m k + k ˜ m n − m k ) + C ( δ n + ζ ) + k ˜ m n − m k ≤ C (cid:18) C ( δ n + ζ ) + C ( k ˆ m − m k + k ˜ m n − m k ) (cid:19) + C ( δ n + ζ ) + k ˜ m n − m k ≤ C ( + C ) ( δ n + ζ ) + (cid:16) k ˆ m − m k + k ˜ m n − m k (cid:17) + k ˜ m n − m k Re-arranging the last inequality, yields: k ˆ m − m k ≤ C ( + C ) ( δ n + ζ ) + k ˜ m n − m k (B.11) B.2 Critical Radius of Shallow Trees

VC dimension of F We now show that when the partition P n , P ( D n ) is deﬁned by a treewith t leafs, then the function class F is a VC-subgraph class. Let F ( ζ ) denote the subgraph of F at any level ζ (i.e. the space of binary functions F ( ζ ) , { x → { f ( x ) > ζ } : f ∈ F } . To showthat F is VC-subgraph with VC dimension v , we need to show that F ( ζ ) has VC dimension atmost v .Observe that the number of all possible observationally equivalent functions that the functionclass F ( ζ ) can output on n samples is at most ( nd ) t t . This follows by the following argument:the number of possible functions is equal to the number of possible partitions of the n samples23hat can be induced by a tree with t leafs, multiplied by the number of possible binary valueassignments at the leafs. The latter is 2 t . The former is at most ( nd ) t . On the other hand, the set of all binary functions on n points is 2 n . Thus for the function class F ( ζ ) to be able to shatter a set of n points, it must be that 2 n ≤ ( n d ) t . Equivalently: n ≤ t log ( d ) + t log ( n ) ⇒ m ≤ t log ( t ) + t log ( d ) = O ( t log ( d t )) (B.12)Thus we get that the function class F ( ζ ) has VC dimension at most v = O ( t log ( d t )) . Thus F isa VC-subgraph class of VC dimension v = O ( t log ( d t )) . Bounding the critical radius

We will use the fact that the critical radius of star (cid:0)

F − m (cid:1) is O ( δ n ) , where δ n is any solution to the inequality (see e.g. [Wai19]): Z δδ /8 s H ( ε , star ( F − m ) δ , n , z s ) n ≤ δ (B.13)where G δ , n = { g ∈ G : k g k n = q n ∑ i g ( z i ) ≤ δ } and H ( ε , G , z n ) is the logarithm of the size ofthe smallest ε -cover of G , with respect to the empirical ℓ norm k g k n on the samples z n .First observe that the star hull can only add at most a logarithmic extra factor to the metricentropy, by a simple discretization argument on the parameter δ , i.e.: H ( ε , star ( F − m ) δ , n , z n ) ≤ H ( ε /2, ( F − m ) δ , n , z n ) + log ( f ∈ ( F− m ) δ , n k f k n / ε ) ≤ H ( ε , ( F − m ) δ , n , z n ) + log ( δ / ε ) Moreover, observer that the metric entropy of ( F − m ) δ , n is at most the metric entropy of F − m , which is at most the metric entropy of F (since m is a ﬁxed function). Thus it sufﬁces tobound the metric entropy of F .Theorem 2.6.7 of [VW96] shows that for any VC-subgraph class G of VC dimension v andbounded in [ −

1, 1 ] we have: H ( ε , G , z n ) = O ( v ( + log ( ε ))) (B.14)This implies that the critical radius of star ( F − m ) is of the order of any solution to the inequality: Z δδ /8 r v ( + log ( ε ) + log ( δ / ε ) n ≤ δ The left hand side is of order δ q v ( + log ( δ )) n . Thus the critical radius needs to satisfy for someconstant D , that: δ ≥ D r v ( + log ( δ )) n (B.15) This can be shown by induction; Let S s , t be the number of possible partitions induced by a tree with t leafs. Then S s ,1 = t leafs, we need to take a tree with t − s samples belongs to along the dimension of one of the d features. Thus we have d s total choices, leading to S s , t = S s , t − d s = ( d s ) t δ = Θ r v ( + log ( n )) n ! (B.16)Thus the critical radius of star ( F − m ) is Θ (cid:18)q t log ( d t ) ( + log ( n )) n (cid:19) . Corollary B.3 (Critical Radius of Shallow Trees) . Let P ( D n ) , be a function that maps any set oftraining samples D n into a partition of {

0, 1 } d , deﬁned in the form of a binary tree with t leafs. Then thecritical radius of star ( F − m ) , as deﬁned in Lemma B.2 is Θ (cid:18)q t log ( d t ) ( + log ( n )) n (cid:19) . Bias-Variance Decomposition of Deep Honest Forests

We next need the notion of diameter of a cell A with respect to the value of m ( x ) . Deﬁnition C.1 (Value-Diameter of a Cell) . Given set B ⊆ {

0, 1 } d we deﬁne the subset B ⊆ B such that x ∈ B if and only if P z ∼D x ( z ∈ B ) >

0. The value-diameter ∆ ( B ) of B to be equal to ∆ m ( B ) = max x , y ∈ B ( m ( x ) − m ( y )) . For any partition P of {

0, 1 } d we deﬁne the value-diameterof the partition P to be ∆ m ( P ) = max A ∈P P x ∼D x ( x ∈ A ) · ∆ m ( A ) . Lemma C.2.

Consider any forest with B trees, where each tree is built with honesty and on a randomsub-sample of size s. Let ε ( s ) = E x ∼D x ,D s /2 ∼D s /2 [ ∆ m ( P s /2 ( x ))] . Then P D n ∼D n (cid:18) E x ∼D x [( m n , s ( x ) − m ( x )) ] ≤ O (cid:18) s log ( n / δ ) n + d log ( δ ) B (cid:19) + ε ( s ) (cid:19) ≥ − δ . Proof.

We start with deﬁning the following function m s ( x ) = E D n ∼D n [ m n , s ( x )] = E D s ∼D s [ m s ( x )] . (C.1)For mean squared error of m n , s we have: E x ∼D x ,D n ∼D n [( m n , s ( x ) − m ( x )) ] = E x ∼D x ,D n ∼D n [( m n , s ( x ) − m s ( x )) ] + E x ∼D x [( m s ( x ) − m ( x )) ] The ﬁrst part we know that it is bounded for every x and with exponential tails due to concen-tration of U-statistics [Hoe94, PAR10]., i.e. for any ﬁxes x with probability 1 − δ it holds that ( m n , s ( x ) − m s ( x )) ≤ O (cid:18) s log ( δ ) n (cid:19) (C.2)Thus integrating over x ∼ D x , we have: P x ∼D x ,D n ∼D n (cid:18) ( m n , s ( x ) − m s ( x )) ≥ O (cid:18) s log ( δ ) n (cid:19)(cid:19) ≤ δ . (C.3)Let T ( x ) = ( m n , s ( x ) − m s ( x )) and ε = Θ (cid:16) s log ( δ ) n (cid:17) . Suppose that with probability more than n δ over the training set D n ∼ D n , we had that P x ∼D x ( T ( x ) ≥ ε | D n ) ≥ n . Then we have that P x ∼D x ,D n ∼D n ( T ( x ) ≥ ε ) ≥ δ , which contradict C.3. Thus we know that with probability 1 − n δ over D n ∼ D n it holds that P x ∼D x ( T ( x ) ≥ ε | D n ) ≤ n . Hence with probability 1 − n δ over thetraining set D n ∼ D n it holds that E x ∼D x [( m n , s ( x ) − m s ( x )) ] ≤ ε + P x ∼D x ( T ( x ) ≥ ε ) ≤ ε + n = O (cid:18) s log ( δ ) n (cid:19) + n Setting δ ′ = n δ , we have the following P D n ∼D n (cid:18) E x ∼D x [( m n , s ( x ) − m s ( x )) ] ≤ O (cid:18) s log ( n / δ ′ ) n (cid:19)(cid:19) ≥ − δ ′ . (C.4)26or the bias term we deﬁne for simplicity w ( j ) ( x ) = { x ∈P n ( x ( j ) ) } N n ( P n ( x ( j ) )) and hence m s ( x ) = ∑ si = w ( j ) ( x ) y ( j ) and we have: E x ∼D x [( m s ( x ) − m ( x )) ] = E x ∼D x "(cid:18) E D n ∼D n [ m n , s ( x )] − m ( x ) (cid:19) = E x ∼D x  E D s ∼D s " s ∑ j = w ( j ) ( x ) ( y ( j ) − m ( x ( j ) )) + E D n ∼D n " s ∑ j = w ( j ) ( x ) ( m ( x ( j ) ) − m ( x ))  Due to honesty w ( j ) ( x ) is independent of y ( j ) and we have that the ﬁrst term is equal to 0 by atower law. Thus we have: E x ∼D x [( m s ( x ) − m ( x )) ] = E x ∼D x  E D n ∼D n " s ∑ j = w ( j ) ( x ) ( m ( x ( j ) ) − m ( x ))  ≤ E x ∼D x ,D n ∼D n  s ∑ j = w ( j ) ( x )( m ( x ( j ) ) − m ( x )) !  ≤ E x ∼D x ,D s /2 ∼D s /2 [ ∆ m ( P s /2 ( x ))] Proofs for Level Splits Algorithms

In this section we present the proofs of Theorem 3.3 and Theorem 3.4. We start with a proof aboutthe bias of the trees that are produced by Algorithm 1 and then we show how we can bound thevariance term. First, we deﬁne the set K ( S ; D n ) , or for simplicity K n ( S ) , as the partition of thesamples induced by the set of splits S , i.e. K n ( S ) = (cid:8) K n ( S , z ) | z ∈ {

0, 1 } d (cid:9) where we deﬁne K n ( S , z ) as the following set K n ( S , z ) = n j | x ( j ) T n ( S , z ) = z T n ( S , z ) , j ∈ [ n ] o . Observe that K n is thesame as the partition of the samples implied by the partition P n of the space {

0, 1 } d , returned byAlgorithm 1. D.1 Bounding The Bias

We ﬁrst prove a technical lemma for the concentration of the function V n around the function V .Observe that V is not the expected value of V n and hence this concentration bound is not trivial. Lemma D.1.

Assuming that d > , q ∈ [ d ] and k > , we have that P D n ∼D n sup S ⊆ [ d ] , | S |≤ q (cid:12)(cid:12) V n ( S ) − V ( S ) (cid:12)(cid:12) ≥ r q · ( q log ( d · q ) + t ) n ! ≤ exp ( − t ) . Proof.

For the purpose of the proof we will deﬁne the following function that interpolates be-tween then sample based function V n and the population based function V . J n ( S ) , ∑ K ∈K n ( S ) | K | n (cid:18) E ( x , y ) ∼D h y | x S = x ( K ) S i(cid:19) (D.1)First we bound the difference | V n ( S ) − J n ( S ) | in the following claim. Claim D.2.

Assuming that d > , r ∈ [ d ] and t > , we have that P D n ∼D n sup S ⊆ [ d ] , | S |≤ r | V n ( S ) − J n ( S ) | ≥ r r · ( r log ( d · r ) + t ) n ! ≤ exp ( − t ) . Proof.

For the ﬁrst part of the proof, we ﬁx a particular set of splits S . Using the fact that both y ( j ) , m ( · ) take values in [

0, 1 ] we get that | V n ( S ) − J n ( S ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∑ K ∈K n ( S ) | K | n  ∑ j ∈ K | K | y ( j ) ! − E ( x , y ) h y | x S = x ( K ) S i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∑ K ∈K n ( S ) | K | n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∑ j ∈ K | K | y ( j ) ! − E ( x , y ) h y | x S = x ( K ) S i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .Now let Y S ( x S ) be the distribution of the random variable y conditional that the randomvariable x takes value x S at the subset S of the coordinates. Observe that conditional on x ( j ) S , thevariables y ( j ) for j ∈ K ∈ K n ( S ) are i.i.d. samples from the distribution Y S ( x ( K ) S ) . Hence, usingthe Hoeffding’s inequality we have that for any K ∈ K n ( S ) it holds that P y ( j ) ∼Y S ( x ( K ) S ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∑ j ∈ K | K | y ( j ) − E ( x , y ) h y | x S = x ( K ) S i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ s t | K | ! ≤ exp ( − t ) ,28hich, by a union bound over K n ( S ) , implies that P y ( j ) ∼Y S ( x ( j ) S )  _ K ∈K n ( S ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∑ j ∈ K | K | y ( j ) − E ( x , y ) h y | x S = x ( K ) S i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ s ( | S | + t ) | K | ! ≤ exp ( − t ) ,where we have used the fact that after splitting on | S | coordinates we can create at most 2 | S | leafnodes, i.e. |K n ( S ) | ≤ | S | . Hence we have that P y ( j ) ∼Y S ( x ( j ) S ) | V n ( S ) − J n ( S ) | ≥ q ( | S | + t ) ∑ K ∈K n ( S ) p | K | n ! ≤ exp ( − t ) . (D.2)But we know that ∑ K ∈K n ( S ) | K | = n , and also we have that the for any vector w ∈ R k it holds that k w k ≤ √ k k w k . Therefore if we deﬁne the vector w = ( p | K | ) K ∈K n ( S ) we have that ∑ K ∈K n ( S ) p | K | n = k w k k w k ≤ p |K S |k w k ≤ s | S | n .Now using this in the inequality (D.2) and taking the expectation over x ( j ) S for all j we get thefollowing inequality for any S ⊆ [ d ] . P D n ∼D n  | V n ( S ) − J n ( S ) | ≥ s ( | S | + t ) | S | n  ≤ exp ( − t ) . (D.3)To ﬁnalize the proof, using a union bound over all S ⊆ [ d ] with | S | ≤ r we get that P D n ∼D n sup S ⊆ [ d ] , | S | = r | V n ( S ) − J n ( S ) | ≥ r ( r + t ) r n ! ≤ r ∑ i = (cid:18) di (cid:19)! exp ( − t ) . (D.4)Finally using the fact that log (cid:16) ∑ ri = ( di ) (cid:17) ≤ ( r + ) log ( d · r ) and assuming that d >

1, we havethat r + ( r + ) log ( d r ) ≤ r log ( dr ) and the claim follows.Next we bound the difference (cid:12)(cid:12) J n ( S ) − V ( S ) (cid:12)(cid:12) . Claim D.3.

If we assume that d > , r ∈ [ d ] , t > then we have that P D n ∼D n sup S ⊆ [ d ] , | S |≤ r (cid:12)(cid:12) J n ( S ) − V ( S ) (cid:12)(cid:12) ≥ r r · log ( d · r ) + tn ! ≤ exp ( − t ) . Proof.

For the ﬁrst part of the proof, we ﬁx a particular set of splits S . We then have that (cid:18) E ( x , y ) ∼D [ y | x S = z S ] (cid:19) = (cid:18) E x ∼D x [ m ( x ) | x S = z S ] (cid:19) , M S ( z S ) ,and hence J n ( S ) − V ( S ) = ∑ K ∈K n ( S ) | K | n (cid:16) E h m ( x ) | x S = x ( K ) S i(cid:17) − E x S (cid:20)(cid:16) E x [ m ( x ) | x S ] (cid:17) (cid:21) = n ∑ j ∈ [ n ] M S ( x ( j ) S ) − E x S [ M S ( x S )] .29ow since m ( · ) ∈ (cid:2) − , (cid:3) , we have that for any x ∈ {

0, 1 } d it holds that | M S ( x S ) | ≤ P D n ∼D n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n ∑ j ∈ [ n ] M S ( x ( j ) S ) − E x S [ M S ( x S )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ r t n  ≤ exp ( − t ) .Finally if we apply the union bound over all sets S ⊆ [ d ] , with | S | = r , the claim follows.If we combine Claim D.2 and D.3, the Lemma D.1 follows.Towards bounding the bias term we provide a relaxed version of the Deﬁnition A.1. Deﬁnition D.4.

Given a set S , a positive number η and a training set D n , we deﬁne the sets R η n ( S ) = { i ∈ [ d ] | V n ( S ∪ { i } ) − V n ( S ) > η } and R η ( S ) = { i ∈ [ d ] | V ( S ∪ { i } ) − V ( S ) > η } . Forsimplicity for η = R ( S ) and R n ( S ) .Since V n is a monotone increasing function we have that V n ( S ∪ i ) ≥ V n ( S ) . Hence given S theAlgorithm 1 chooses the direction i that maximizes the positive quantity V n ( S ∪ i ) − V n ( S ) . Sothe bad event is that for all j ∈ [ d ] , V n ( S ∪ i ) − V n ( S ) > V n ( S ∪ j ) − V n ( S ) but V ( S ∪ i ) − V ( S ) = k ∈ [ d ] such that V ( S ∪ k ) − V ( S ) >

0. A relaxed version of this bad event canbe described using the Deﬁnition D.4. In this language the bad event is that the index i ∈ [ d ] thatthe Algorithm 1 chooses to split does not belong to R η ( S ) although R η ( S ) = ∅ . We bound theprobability of this event in the next lemma. Lemma D.5.

Let η = q r · ( r log ( d · r )+ t ) n and assume that d > , r ∈ [ d ] and t > , then it holds that P D n ∼D n  _ S ⊆ [ d ] , | S |≤ r argmax i ∈ [ d ] V n ( S ∪ i ) !

6∈ R ( S ) | R η ( S ) = ∅ ! ≤ ( − t ) Proof.

Directly applying Lemma D.1 to Deﬁnition D.4 we have that P D n ∼D n _ i , S (cid:0) i ∈ R η n ( S ) | i

6∈ R ( S ) (cid:1)! ≤ exp ( − t ) , (D.5)where η = q r · ( r log ( d · r )+ t ) n . Similarly we have that P D n ∼D n _ i , S (cid:0) i

6∈ R η n ( S ) | i ∈ R η ( S ) (cid:1)! ≤ exp ( − t ) . (D.6)If we combine the above inequalities we get that there is a very small probability that there existsan index i ∈ R η ( S ) but an index j

6∈ R ( S ) is chosen instead. This is summarized in the followinginequality P D n ∼D n _ S argmax i ∈ [ d ] V n ( S ∪ i ) !

6∈ R ( S ) | R η ( S ) = ∅ !! ≤ ( − t ) (D.7)and the lemma follows. 30 emma D.6. For every set S ⊆ [ d ] , under Assumption 3.1, if R η ( S ) = ∅ , then E ( x , y ) ∼D "(cid:18) m ( x ) − E ( x , y ) ∼D [ m ( x ) | x S ] (cid:19) ≤ C · η · |R ( S ) | . Proof.

We know that L ([ d ]) = L ( · ) is approximatesupermodular. We ﬁrst prove that L ( S ∪ R ( S )) =

0. If this is not the case then there existsan i

6∈ R ( S ) such that L ( S ∪ R ( S ) ∪ i ) − L ( S ∪ R ( S )) <

0. But because of the approximatesupermodularity of L we have that L ( S ∪ i ) − L ( S ) < C · ( L ( S ∪ R ( S ) ∪ i ) − L ( S ∪ R ( S ))) < i

6∈ R ( S ) .Now assume that R η ( S ) = ∅ and for the sake of contradiction also assume that L ( S ) > C · η · |R ( S ) | . Let { r , . . . , r k } be an arbitrary enumeration of the set R ( S ) . From the argumentbefore we have that L ( S ∪ R ( S )) = r j of R ( S ) such that L ( S ∪ { r , . . . , r j − } ) − L ( S ∪ { r , . . . , r j } ) > C · η ,otherwise we would immediately have L ( S ) ≤ C · η · |R ( S ) | . But because of the approximatesupermodularity of L ( · ) we have that C · ( L ( S ) − L ( S ∪ r j )) ≥ L ( S ∪ { r , . . . , r j − } ) − L ( S ∪ { r , . . . , r j } ) > C · η ,but this last inequality implies r j ∈ R η ( S ) which contradicts with our assumption that R η ( S ) = ∅ . Hence L ( S ) ≤ C · η · |R ( S ) | and the lemma follows.Finally we need one more Lemma to handle the case where Assumption 3.2 holds. Lemma D.7.

Let m be a target function that is ( β , r ) -strongly sparse, with set of relevant features R, andsuppose n ≥ · r · ( r log ( d · r )+ t ) β , then it holds that P D n ∼D n  _ S ⊆ [ d ] , | S |≤ r argmax i ∈ [ d ] V n ( S ∪ i ) ! R | ( R \ S ) = ∅ ! ≤ ( − t ) Proof.

Directly applying Lemma D.1 to Deﬁnition D.4 we have that P D n ∼D n _ i , S (cid:0) i ∈ R η n ( S ) | i

6∈ R ( S ) (cid:1)! ≤ exp ( − t ) , (D.8)where η = q r · ( r log ( d · r )+ t ) n . Similarly we have that P D n ∼D n _ i , S (cid:0) i

6∈ R η n ( S ) | i ∈ R η ( S ) (cid:1)! ≤ exp ( − t ) . (D.9)If we combine the above inequalities with the Assumption 3.2 the lemma follows.We are now ready to upper bound the bias of the Algorithm 1 under the Assumption 3.1.31 heorem D.8. Let D n be i.i.d. samples from the non-parametric regression model y = m ( x ) + ε , wherem ( x ) ∈ [ − ] , ε ∼ E , E ε ∼E [ ε ] = and ε ∈ [ − ] . Let also P n be the partition thatAlgorithm 1 returns. Then under the submodularity Assumption 3.1 the following statements hold.1. If m is r-sparse and we set log ( t ) ≥ C · rC · r + ( log ( n ) − log ( log ( d / δ ))) , then it holds that P D n ∼D n E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P n ( x )] (cid:19) > ˜ Ω C · r · C · r + r log ( d / δ ) n !! ≤ δ .

2. Under the independence of features Assumption 2.2 and assuming that m is r-sparse and if log ( t ) ≥ r, then it holds that P D n ∼D n E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P n ( x )] (cid:19) > ˜ Ω C · r r · log ( d / δ )) n !! ≤ δ . Proof.

We ﬁx S to be the set of splits that Algorithm 1 chooses. Our goal in this lemma is to showthat with high probability the following quantity is small L ( P n ) = E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P n ( x )] (cid:19) . (D.10)We prove this in two steps. First we show that after the completion of the level level of thealgorithm the quantity L ( S level ) is small. Then we bound the difference (cid:12)(cid:12) L ( P level ) − L ( S level ) (cid:12)(cid:12) inClaim D.9. Finally we use the monotonicity of L to argue about the upper bound on P n .Let R ⊆ [ d ] be the set of size | R | = r of the relevant features of the target function m . Observethat it holds that L ( S ∪ R ) = V ( S ∪ R ) , V ∗ is maximized. Since m ( x ) ∈ [ − ] ,the maximum value of V is 1.For the ﬁrst part of the theorem, let { i , . . . , i r } and be an arbitrary enumeration of R andlet R j = { i , . . . , i j } then by adding and subtracting terms of the form V ( S ∪ R j ) we have thefollowing equality (cid:0) V ( S ∪ R ) − V ( S ∪ R r − ) (cid:1) + · · · + (cid:0) V ( S ∪ R ) − V ( S ∪ { i } ) (cid:1) + V ( S ∪ { i } ) = V ∗ .From the approximate submodularity of V we hence have that (cid:0) V ( S ∪ { i r } ) − V ( S ) (cid:1) + · · · + (cid:0) V ( S ∪ { i } ) − V ( S ) (cid:1) + (cid:0) V ( S ∪ { i } ) − V ( S ) (cid:1) ≥ V ∗ − V ( S ) C which implies max j ∈ [ r ] (cid:0) V ( S ∪ { i j } ) − V ( S ) (cid:1) ≥ V ∗ − V ( S ) C · r .Let i level be the coordinate that the algorithm chose to split at level level. Now from the greedycriterion of Algorithm 1 we get that the coordinate i level that we picked to split was at leastas good as the best of the coordinates in R , hence using Lemma D.1 and if we deﬁne η = q q · ( q log ( d · q )+ log ( δ )) n , where q is the maximum depth of the tree for which we are applyingLemma D.1, we have that with probability at least 1 − δ it holds that V (cid:16) S ∪ { i level } (cid:17) ≥ V ( S ) + V ∗ − V ( S ) C · r − η L ∗ , L ( S ∪ R ) = L (cid:16) S ∪ { i level } (cid:17) ≤ L ( S ) (cid:18) − C · r (cid:19) + η . (D.11)Let S level to be the set of splits after the step level of Algorithm 1. Then it holds that L ( S level + ) ≤ L ( S level ) (cid:18) − C · r (cid:19) + η .Inductively and using the fact that m ( x ) ∈ [ − ] , the latter implies that L ( S level ) ≤ L ( ∅ ) (cid:18) − C · r (cid:19) level + · η ≤ (cid:18) − C · r (cid:19) level + · η . (D.12)From the choice of t we have that for level = C · r ln ( η ) it holds L ( S level ) ≤ · C · r ln ( η ) η .For this analysis to be consistent we have to make sure that the maximum depth q for which weare applying Lemma D.1 is at least the value required for level. Thus we need to ﬁnd values for q , η such that q ≥ C · r ln ( η ) at the same time when η ≥ q q · ( q log ( d · q )+ log ( δ )) n . It is easy tosee that the smallest possible value for η is hence achieved for q = C · rC · r + ( log ( n ) − log ( log ( d / δ ))) and η = ˜ Θ (cid:18) C · r + q log ( d / δ ) n (cid:19) . Hence the inequality L ( S level ) ≤ · C · r ln ( η ) η which implies L ( S level ) ≤ ˜ O C · r · C · r + r log ( d / δ ) n ! . (D.13)For the second part of the theorem we use Lemma D.5 and we have that at every step eitherthe algorithm chooses to split with respect to a direction i ∈ R ( S ) or R η ( S ) = ∅ . Becauseof our assumption that m is r -sparse and because we assume that the features are distributedindependently we have that at any step |R ( S ) | ≤ r . Hence, when level = r it has to be that theset S level during the execution of the Algorithm 1 satisﬁes R η ( S ) = ∅ . Then using Lemma D.6we have that L ( S level ) ≤ C · η · r from which we get that L ( S level ) ≤ O C · r r · log ( d / δ )) n ! . (D.14)Next we need to compare L ( S level ) with L ( P level ) . Claim D.9 (Dealing with Empty Cells) . It holds that with probability − δ (cid:12)(cid:12) L ( P level ) − L ( S level ) (cid:12)(cid:12) ≤ · level level ln ( d level ) + ln ( δ ) n . Proof.

Fix any possible cell A after doing a full partition on the ﬁrst q , level splits of thealgorithm. For simplicity of the exposition of this proof we deﬁne for every subset B of {

0, 1 } n the probability P B , P x ∼D x [ x ∈ B ] and the empirical probability ˆ P B , n ∑ ni = { x ( j ) ∈ B } . Usingthe Chernoff-Hoeffding bound we get that P D n ∼D n ˆ P A ≥ P A − r ( δ ) P A n ! ≥ − δ .33f the empirical probability ˆ P A is zero then we get the following P D n ∼D n (cid:18) P A ≤ ( δ ) n + { ˆ P A = } (cid:19) ≥ − δ .The number of possible cells from a tree of depth q is at most 2 q d q q q . Therefore, by union boundover all the possible cells A , we have that P D n ∼D n _ A (cid:18) P A ≤ q ln ( dq ) + ( δ ) n + { ˆ P A = } (cid:19)! ≥ − δ . (D.15)Next, we consider the difference L ( P level ) − L ( S level ) . Let P level S be the partition of space if wesplit along all the coordinates in S level . It is easy to see that P level S is a reﬁnement of the partition P level . Hence in the difference L ( P level ) − L ( S level ) we only have terms of the form P x ∼D x ( x ∈ B ) E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ B ] (cid:19) | x ∈ B −− ℓ ∑ j = P x ∼D x (cid:0) x ∈ A j (cid:1) E x ∼D x "(cid:18) m ( x ) − E z ∼D x (cid:2) m ( z ) | z ∈ A j (cid:3)(cid:19) | x ∈ A j .Where B ∈ P level , A j ∈ P level S and B is the union of the cells A , . . . , A ℓ . In order for B to remainunsplit in P level it has to be that for all but one of A j ’s it holds that ˆ P A j =

0. We denote with E ( B ) the above difference and we observe that it is equal to the following ℓ ∑ j = P A j E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ B ] (cid:19) − (cid:18) m ( x ) − E z ∼D x (cid:2) m ( z ) | z ∈ A j (cid:3)(cid:19) | x ∈ A j .Without loss of generality we will assume that A is the only subcell of B that is not empty. Wedeﬁne Q ( A ) the following quantity P A E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ B ] (cid:19) − (cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ A ] (cid:19) | x ∈ A .Since m ( x ) ∈ [ − ] , we get that E ( B ) ≤ Q ( A ) + ℓ ∑ j = P A j . (D.16)Next we also bound Q ( A ) by the measure of cells in B \ A . The intuition why Q ( A ) issmall is that, since the cells B \ A have small measure, then the conditional expectation of m ( z ) conditional on z ∈ B is very close to the conditional expectation of m ( z ) conditional on z ∈ A .More formally, since x is 2-Lipschitz for x ∈ [ −

1, 1 ] , m ( z ) ∈ [ − ] and A ⊆ B : Q ( A ) ≤ P A (cid:12)(cid:12)(cid:12)(cid:12) E z ∼D x [ m ( z ) | z ∈ B ] − E z ∼D x [ m ( z ) | z ∈ A ] (cid:12)(cid:12)(cid:12)(cid:12) = P A (cid:12)(cid:12)(cid:12)(cid:12) E z ∼D x [ m ( z ) | z ∈ B \ A ] − E z ∼D x [ m ( z ) | z ∈ A ] (cid:12)(cid:12)(cid:12)(cid:12) P ( z ∈ B \ A | z ∈ B ) ≤ P A P B · ( P B − P A ) ≤ ( P B − P A ) = ℓ ∑ j = P A j ! A j , with j ≥

2, have ˆ P A j =

0, this means that P A j ≤ q ln ( dq )+ ( δ ) n due to(D.15). Putting this together with (D.16) we get that E ( B ) ≤ ℓ ∑ j = P A j ≤ ( ℓ − ) q ln ( dq ) + ln ( δ ) n .Let ℓ B the number of subcells of B ∈ P level that are inside P level S . If we sum over all B ∈ P level weget that (cid:12)(cid:12) L ( P level ) − L ( S level ) (cid:12)(cid:12) ≤ ∑ B ∈P level ℓ B ! q ln ( dq ) + ln ( δ ) n but the sum ∑ B ∈P level ℓ B is less than the size of P level S which is 2 q and hence we get that (cid:12)(cid:12) L ( P level ) − L ( S level ) (cid:12)(cid:12) ≤ · q q ln ( dq ) + ln ( δ ) n .Using Claim D.9 and equations (D.13) and (D.14) we get the ﬁrst two parts of the theorem byobserving that the error term in Claim D.9 is less that the error terms in (D.13) and (D.14).Recall the deﬁnition of the value-diameter of a cell C.1. We are now ready to upper boundthe bias under the strong sparsity Assumption 3.2. Theorem D.10.

Let D n be i.i.d. samples from the non-parametric regression model y = m ( x ) + ε , wherem ( x ) ∈ [ − ] , ε ∼ E , E ε ∼E [ ε ] = and ε ∈ [ − ] . Let also P n be the partition thatAlgorithm 1 returns. If m is ( β , r ) -strongly sparse as per Assumption 3.2 then the following statementshold for the bias of the output of Algorithm 1.1. If n ≥ ˜ Ω (cid:16) r ( log ( d / δ )) β (cid:17) and we set log ( t ) ≥ r, then it holds that P D n ∼D n E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P n ( x )] (cid:19) > ˜ Ω (cid:18) r · log ( d / δ )) n (cid:19)! ≤ δ .

2. If R is the set of relevant features and and for every w ∈ {

0, 1 } r it holds for the marginal probabilitythat P z ∼D x ( z R = w ) ( ζ /2 r ) and if n ≥ ˜ Ω (cid:16) r ( log ( d / δ )) β + r log ( δ ) ζ (cid:17) and we set log ( t ) ≥ r,then it holds that P D n ∼D n ( ∆ m ( P n ) = ) ≥ − δ .

3. Let R be the set of relevant features, x ∈ {

0, 1 } d such that P z ∼D x ( z R = x R ) ≥ ζ /2 r , and assumethat we run Algorithm 1 with input h = and log ( t ) ≥ r. If n ≥ ˜ Ω (cid:16) r ( log ( d / δ )) β + r log ( δ ) ζ (cid:17) ,then it holds that (cid:18) E D n ∼D n [ m n ( x )] − m ( x ) (cid:19) ≤ δ .35 roof. For the ﬁrst part of the theorem we observe that Lemma D.7 implies that L ( S level ) = A after doing a full partition onthe ﬁrst r splits of the Algorithm 1. For simplicity of the exposition of this proof we deﬁnefor every subset B of {

0, 1 } n the probability P B , P x ∼D x [ x ∈ B ] and the empirical probabilityˆ P B , n ∑ ni = { x ( j ) ∈ B } . Using the multiplicative form of the Chernoff bound we get that P D n ∼D n  n ˆ P A ≥  − s ( δ ) nP A  nP A  ≥ − δ .Hence for n ≥ ( δ ) P A we have that P D n ∼D n n ∑ i = { x ( i ) ∈ A } ≥ ! ≥ − δ .Next we can apply a union bound over all possible cell A that split according to the R coordinatesand using our assumption that P A ≥ ζ r we get that for n ≥ r ( r + log ( δ ) ζ it holds that P D n ∼D n _ A n ∑ i = { x ( i ) ∈ A } ≥ !! ≥ − δ . (D.17)Now let S be the set of splits after r iterations of the Algorithm 1. Then the Lemma D.7 impliesthat S = R and L ( S ) =

0. Finally from (D.17) we also have that the partition P r after r iterationsof Algorithm 1 is the partition the full partition to all the cells of R and hence E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P n ( x )] (cid:19) ≤ E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z S = x S ] (cid:19) where the later is 0 with high probability because of Lemma D.7. This means that with probabilityat least 1 − δ it holds that ∑ A ∈P n P x ∼D x ( x ∈ A ) · E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ A ] (cid:19) | x ∈ A = A ∈ P n it holds that either P x ∼D x ( x ∈ A ) = E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ A ] (cid:19) | x ∈ A = P x ∼D x ( x ∈ A ) = ∆ m ( A ) = ∆ m ( P n ) = − δ .For the third part of the theorem we deﬁne for simplicity w ( j ) ( x ) = { x ( j ) T n ( S , x ) = x T n ( S , x ) } N n ( x ; T n ( S , x )) and hence36 n ( x ) = ∑ ni = w ( j ) ( x ) y ( j ) and we have: (cid:18) E D n ∼D n [ m n ( x )] − m ( x ) (cid:19) == E D n ∼D n " n ∑ j = w ( j ) ( x ) ( y ( j ) − m ( x ( j ) )) + E D n ∼D n " n ∑ j = w ( j ) ( x ) ( m ( x ( j ) ) − m ( x )) Due to honesty w ( j ) ( x ) is independent of y ( j ) and we have that the ﬁrst term is equal to 0 by atower law. Thus we have: (cid:18) E D n ∼D n [ m n ( x )] − m ( x ) (cid:19) = E D n ∼D n " s ∑ j = w ( j ) ( x ) ( m ( x ( j ) ) − m ( x )) ≤ E D n ∼D n  n ∑ j = w ( j ) ( x )( m ( x ( j ) ) − m ( x )) !  Let also A = { z | z R = x R } , then using the multiplicative form of the Chernoff Bound from theproof of the second part of the theorem above we get P D n ∼D n (cid:16) ∑ ni = { x ( i ) ∈ A } ≥ (cid:17) ≥ − δ .Therefore with probability 1 − δ the path of the tree that leads to x has split all the relevantcoordinates R and hence for all j such that w ( j ) ( x ) > x ( j ) R = x R which in turnimplies that m ( x ( j ) ) = m ( x ) . With the rest δ probability the square inside the expectation is atmost 1 since m ( · ) ∈ (cid:2) − , (cid:3) , hence we get (cid:18) E D n ∼D n [ m n ( x )] − m ( x ) (cid:19) ≤ δ . D.2 Proof of Theorem 3.3

Observe that the output estimate m n ( · ; S ) and partition P n of Algorithm 1, satisﬁes the conditionsof Lemma B.2. Moreover, by Corollary B.3, we have that the critical radius quantity δ n is of order Θ (cid:18)q t log ( d t )( + log ( n )) n (cid:19) , if we grow the tree at depth log ( t ) . Thus applying the bound presentedin (B.4) with the bound on δ n we have the following cases:1. from case 1 of Theorem D.8 we get case 1 of Theorem 3.3,2. from case 2 of Theorem D.8 we get case 2 of Theorem 3.3 and3. from case 1 of Theorem D.10 we get case 3 of Theorem 3.3. D.3 Proof of Theorem 3.4

From case 2. of Theorem D.10 and since the maximum possible value diameter is 1, we havethat if s ≥ ˜ Ω (cid:16) r ( log ( d / δ )) β + r log ( δ ) ζ (cid:17) then E D n ∼D n [ ∆ m ( P s )] ≤ δ which implies E x ∼D x [( m s ( x ) − m ( x )) ] ≤ δ . Putting this together with the Lemma C.2 we get that if s = ˜ Ω (cid:16) r ( log ( d / δ )) β + r log ( δ ) ζ (cid:17) then P D n ∼D n (cid:18) E x ∼D x [( m n , s ( x ) − m ( x )) ] ≥ ˜ Ω (cid:18) r ( log ( d / ( δ · δ ′ ))) n · β + r log ( ( δ · δ ′ )) n · ζ (cid:19) + δ (cid:19) ≤ δ ′ .37rom the above we get Theorem 3.4 by setting δ = ˜ Ω (cid:16) r ( log ( d / δ ′ )) n · β + r log ( δ ′ ) n · ζ (cid:17) . E Proofs for Breiman’s Algorithm

In this Section we present the proof of the Theorem 4.4 and the Theorem 4.5. We start with aproof about the bias of the trees that are produced by the Algorithm 2 and then we show howwe can combine this with a bias-variance decomposition and bounds on the variance part.

E.1 Bounding The Bias

We start with deﬁnitions of the empirical mean squared error for a given partition P and theempirical mean squared error of a leaf for a particular leaf A . For the derivations below, weremind the following deﬁnitions from Section 4: for a cell A , we deﬁne the set Z n ( A ) , as thesubset of the training set Z n ( A ) = { j | x ( j ) ∈ A } and we deﬁne the partition U n ( P ) of thetraining set D n as U n ( P ) = {Z n ( A ) | A ∈ P } . L n ( P ) , n ∑ j ∈ [ n ] (cid:16) y ( j ) − m n ( x ( j ) ; P ) (cid:17) (E.1) = n ∑ j ∈ [ n ] (cid:16) y ( j ) (cid:17) + n ∑ j ∈ [ n ] m n ( x ( j ) ; P ) − ∑ j ∈ [ n ] n y ( j ) m n ( x ( j ) ; P )= n ∑ j ∈ [ n ] (cid:16) y ( j ) (cid:17) + ∑ Z ∈Z n ( P ) | Z | n m n ( x ( Z ) ; P ) − ∑ Z ∈Z n ( P ) | Z | n ∑ j ∈ Z | Z | y ( j ) ! m n ( x ( Z ) ; P )= n ∑ j ∈ [ n ] (cid:16) y ( j ) (cid:17) − n ∑ j ∈ [ n ] m n ( x ( j ) ; P ) , n ∑ j ∈ [ n ] (cid:16) y ( j ) (cid:17) − V n ( P ) . (E.2) L ℓ n ( A ) , N n ( A ) ∑ j ∈Z n ( A ) (cid:16) y ( j ) − m n ( x ( j ) ; A ) (cid:17) (E.3) = N n ( A ) ∑ j ∈Z n ( A ) (cid:16) y ( j ) (cid:17) + N n ( A ) ∑ j ∈Z n ( A ) m n ( x ( j ) ; A ) − ∑ j ∈Z n ( A ) N n ( A ) y ( j ) · m n ( x ( j ) ; A )= N n ( A ) ∑ j ∈Z n ( A ) (cid:16) y ( j ) (cid:17) + m n ( x ( Z n ( A )) ; A ) −  ∑ j ∈Z n ( A ) N n ( A ) y ( j )  m n ( x ( Z n ( A )) ; A )= N n ( A ) ∑ j ∈Z n ( A ) (cid:16) y ( j ) (cid:17) − m n ( x ( j ) ; A ) , N n ( A ) ∑ j ∈ Z n ( A ) (cid:16) y ( j ) (cid:17) − V ℓ n ( A ) , (E.4)We ﬁrst prove a technical lemma for the concentration of the function V ℓ n around the function V ℓ . Observe that V ℓ is not the expected value of V ℓ n and hence this concentration bound is not atrivial one. 38 eﬁnition E.1 ( Large Cells ) . Let A ( q , ζ ) be the set of ( q , ζ ) -large cells such that A ∈ A ( q , ζ ) ifand only if | A | ≥ d − q and P x ∼D x ( x ∈ A ) ≥ ζ /2 q . Lemma E.2.

If d > , r ∈ [ d ] , t > and n ≥ + q ζ ( q log ( d ) + t ) then we have that P D n ∼D n  sup A ∈A ( q , ζ ) (cid:12)(cid:12)(cid:12) V ℓ n ( A , i ) − V ℓ ( A , i ) (cid:12)(cid:12)(cid:12) ≥ s + q ( q log ( d ) + t ) ζ · n  ≤ exp ( − t ) . Proof.

For the purpose of the proof we will deﬁne the following function that interpolates be-tween the sample based function V ℓ n and the population based function V ℓ . J n ( A , i ) , ∑ z ∈{ } N n ( A iz ) N n ( A ) (cid:18) E ( x , y ) ∼D h y | x ∈ A iz i(cid:19) (E.5)First we bound the difference (cid:12)(cid:12) V ℓ n ( A , i ) − J n ( A , i ) (cid:12)(cid:12) in the following claim. Claim E.3.

If d > , q ∈ [ d ] , t > , and n ≥ q ζ ( q log d + t ) then we have that P D n ∼D n  sup A ∈A ( q , ζ ) (cid:12)(cid:12)(cid:12) V ℓ n ( A , i ) − J n ( A , i ) (cid:12)(cid:12)(cid:12) ≥ s + q ( q log ( d ) + t ) ζ · n  ≤ exp ( − t ) . Proof.

We start by ﬁxing a speciﬁc cell A ∈ A ( q , ζ ) . This cell A is ﬁxed before we observe thetraining samples D n . We have that (cid:12)(cid:12)(cid:12) V ℓ n ( A , i ) − J n ( A , i ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∑ z ∈{ } N n ( A iz ) N n ( A )  ∑ j ∈Z n ( A iz ) N n ( A iz ) y ( j )  − E ( x , y ) ∼D h y | x ∈ A iz i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Since m ( x ) ∈ (cid:2) − , (cid:3) and ε ∈ (cid:2) − , (cid:3) , we have that y ∈ [ −

1, 1 ] and hence ≤ ∑ z ∈{ } N n ( A iz ) N n ( A ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∑ j ∈Z n ( A iz ) N n ( A iz ) y ( j ) − E ( x , y ) ∼D h y | x ∈ A iz i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .Now let Y ( A ) be the distribution of the random variable y conditional on the fact that therandom variable x lies in the cell A . Observe that since A is cell ﬁxed before observing thetraining set D n , the samples y ( j ) for j ∈ Z n ( A ) are i.i.d. samples from the distribution Y ( A ) conditional on the event that x ( j ) is in A . We deﬁne Q ( A , K ) to be the event that Z n ( A ) = K where K ⊆ [ d ] and we have the following using Hoeffding’s inequality. P D n ∼D n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∑ j ∈Z n ( A iz ) N n ( A iz ) y ( j ) − E ( x , y ) ∼D h y | x ∈ A iz i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ s t + ln ( ) N n ( A iz ) | Q ( A iz , K z )  ≤ e − t ,Where z ∈ {

0, 1 } and K , K are two disjoint subsets of [ d ] . Observe that conditional on Q ( A , K i ) the number N n ( A iz ) is equal to (cid:12)(cid:12) K i (cid:12)(cid:12) and hence is not a random variable any more. Then fromunion bound we have that if we condition on Q ( A i , K ) ∩ Q ( A i , K ) then we have that P D n ∼D n  _ z ∈{ } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∑ j ∈Z n ( A iz ) N n ( A iz ) y ( j ) − E ( x , y ) ∼D h y | x ∈ A iz i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ s ( ( ) + t ) N n ( A iz )  ≤ e − t ,39here we dropped the condition on Q ( A i , K ) ∩ Q ( A i , K ) from the above notation for simplicityof exposition. Hence we have the following which holds again if we condition on the event Q ( A i , K ) ∩ Q ( A i , K ) . P D n ∼D n (cid:12)(cid:12)(cid:12) V ℓ n ( A , i ) − J n ( A , i ) (cid:12)(cid:12)(cid:12) ≥ p ( ( ) + t ) ∑ z ∈{ } p N n ( A iz ) N n ( A ) ! ≤ exp ( − t ) . (E.6)But we know that ∑ z ∈{ } (cid:12)(cid:12) N n ( A iz ) (cid:12)(cid:12) = N n ( A ) , and also we have that the for any vector w ∈ R k itholds that k w k ≤ √ k k w k . Therefore we have that ∑ z ∈{ } p N n ( A iz ) N n ( A ) ≤ s N n ( A ) .Now using this in inequality (E.6) and taking the expectation over D n conditional on the event R ( A , k ) that is equal to the event that N n ( A ) = k and by the law of total expectation we get thefollowing inequality for any possible cell A . P D n ∼D n (cid:12)(cid:12)(cid:12) V ℓ n ( A , i ) − J n ( A , i ) (cid:12)(cid:12)(cid:12) ≥ r ( ( ) + t ) k | R ( A , k ) ! ≤ exp ( − t ) . (E.7)Since A is a cell of size at least 2 d − q we have that there exists a set Q A ⊆ [ d ] with q A = | Q A | ≤ q and a vector w A ∈ {

0, 1 } q A such that x ∈ A ⇔ x Q A = w A . Therefore by the assumption that A ∈ A ( q , ζ ) we have that P x ( x ∈ A ) ≥ ζ q . Hence from classical Chernoff bound for binaryrandom variables we have that P D n ∼D n (cid:18) N n ( A ) ≤ n · ζ · q (cid:19) ≤ exp (cid:18) − n · ζ q (cid:19) (E.8)Then by combining (E.7) and E.8 in the Bayes rule we get that P D n ∼D n (cid:12)(cid:12)(cid:12) V ℓ n ( A , i ) − J n ( A , i ) (cid:12)(cid:12)(cid:12) ≥ s ( ( ) + t ) · q ζ · n ! ≤ exp ( − t ) + exp (cid:18) − n · ζ q (cid:19) . (E.9)It is also easy to see that the number of possible cells A with size at least 2 d − q is at most 2 q · (cid:16) ∑ qi = ( di ) (cid:17) and hence |A ( q , ζ ) | ≤ (cid:16) ∑ qi = ( di ) (cid:17) . Now using a union bound of (E.9) over all possiblecells A ∈ A ( q , ζ ) we get P D n ∼D n sup A ∈A ( q , ζ ) (cid:12)(cid:12)(cid:12) V ℓ n ( A , i ) − J n ( A , i ) (cid:12)(cid:12)(cid:12) ≥ s ( ( ) + t ) · q ζ · n ! ≤≤ r ∑ i = (cid:18) di (cid:19)! (cid:18) exp ( − t ) + exp (cid:18) − n · ζ q (cid:19)(cid:19) . (E.10)Finally using log (cid:16) ∑ qi = ( di ) (cid:17) ≤ ( q + ) log ( d · q ) and since d >

1, the claim follows.Next we bound the difference (cid:12)(cid:12) J n ( A , i ) − V ℓ ( A , i ) (cid:12)(cid:12) .40 laim E.4. If d > , r ∈ [ d ] , t > , and n ≥ q ζ ( q log d + t ) then we have that P D n ∼D n  sup A ∈A ( q , ζ ) (cid:12)(cid:12) J n ( A , i ) − V ℓ ( A , i ) (cid:12)(cid:12) ≥ s + q · ( q log ( d ) + t ) n · ζ  ≤ exp ( − t ) . Proof.

Since the error distribution has zero mean, i.e. E [ ε ] =

0, we have that (cid:18) E ( x , y ) ∼D [ y | x ∈ A ] (cid:19) = (cid:16) E x [ m ( x ) | x ∈ A ] (cid:17) , M ( A ) and hence (cid:12)(cid:12) J n ( A , i ) − V ℓ ( A , i ) (cid:12)(cid:12) ≤ ∑ z ∈{ } (cid:12)(cid:12)(cid:12)(cid:12) N n ( A iz ) N n ( A ) − P x (cid:16) x ∈ A iz | x ∈ A (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) (cid:18) E ( x , y ) ∼D h y | x ∈ A iz i(cid:19) ≤ ∑ z ∈{ } (cid:12)(cid:12)(cid:12)(cid:12) N n ( A iz ) N n ( A ) − P x (cid:16) x ∈ A iz | x ∈ A (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) .Now from Hoeffding bound we have that P D n ∼D n (cid:18) (cid:12)(cid:12)(cid:12)(cid:12) N n ( A iz ) k − P x (cid:16) x ∈ A iz | x ∈ A (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:12)(cid:12)(cid:12)(cid:12) N n ( A ) = k (cid:19) ≤ exp (cid:0) − · k · t (cid:1) excluding the case N n ( A ) ≤ n · ζ · q and taking expectation of both sides we get that P D n ∼D n (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) N n ( A iz ) N n ( A ) − P x (cid:16) x ∈ A iz | x ∈ A (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:19) ≤ E D n ∼D n (cid:20) exp (cid:0) − t · N n ( A ) (cid:1) | N n ( A ) > n · ζ · q (cid:21) ++ P D n ∼D n (cid:18) N n ( A ) ≤ n · ζ · q (cid:19) now we can use (E.8) to get that P D n ∼D n (cid:12)(cid:12)(cid:12)(cid:12) N n ( A iz ) N n ( A ) − P x (cid:16) x ∈ A iz | x ∈ A (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≥ s · q · tn · ζ ! ≤ exp ( − t ) + exp (cid:18) − n · ζ q (cid:19) Finally if we apply the union bound over all possible cells A ∈ A ( q , ζ ) together with the as-sumption that n ≥ q ζ ( q log d + t ) , the claim follows.If we combine Claim E.3 and E.4, the lemma follows.We are now ready to prove that the bias of every tree that is constructed by Algorithm 2 issmall under the Assumption 4.1. We start by proving the ﬁnite sample analogues of Lemma A.6.First we provide a relaxed version of the Deﬁnition A.5 and a version with ﬁnite samples. Deﬁnition E.5.

Given a partition P and a cell A of P , a positive number η and a training setD n , we deﬁne the sets L η n ( A ) = { i ∈ [ d ] | V ℓ n ( A , i ) − V ℓ n ( A ) > η } and L η ( A ) = { i ∈ [ d ] | V ℓ ( A , i ) − V ℓ ( A ) > η } . Similarly, when η = L n ( A ) and L ( A ) .41ince V n is monotone decreasing with respect to P , we have that V n ( P , A , i ) ≥ V n ( P ) .Hence given P , A the Algorithm 2 chooses the direction i that maximizes the positive quantity V n ( P , A , i ) − V n ( P ) . So the bad event is that for all j ∈ [ d ] , V n ( P , A , i ) − V n ( P ) ≥ V n ( P , A , j ) − V n ( P ) but V ( P , A , i ) − V ( P ) = k ∈ [ d ] such that V ( P , A , k ) − V ( P ) > i ∈ [ d ] that the Algorithm 2 chooses to split does not belong to L η ( P , A ) although L η ( P , A ) = ∅ . We bound the probability of this event in the next lemma. Lemma E.6.

If d > , r ∈ [ d ] , t > , and n ≥ + q ζ ( q log ( d ) + t ) and let η = q + q ( q log ( d )+ t ) ζ · n then wehave that P D n ∼D n  _ A ∈A ( q , ζ ) argmax i ∈ [ d ] V ℓ n ( A , i ) !

6∈ L ( A ) | L η ( A ) = ∅ ! ≤ ( − t ) Proof.

Directly applying Lemma E.2 to Deﬁnition E.5 we have that P D n ∼D n  _ i , A ∈A ( q , ζ ) (cid:0) i ∈ L η n ( A ) | i

6∈ L ( A ) (cid:1) ≤ exp ( − t ) , (E.11)where η = q + q ( q log ( d )+ t ) ζ · n and n ≥ + q ζ ( q log ( d ) + t ) . Similarly we have that P D n ∼D n  _ i , A ∈A ( q , ζ ) (cid:0) i

6∈ L η n ( A ) | i ∈ L η ( A ) (cid:1) ≤ exp ( − t ) . (E.12)If we combine the above inequalities we get that there is a very small probability that thereexists an index i ∈ L η ( A ) but an index j

6∈ L ( A ) is chosen instead. This is summarized in thefollowing inequality P D n ∼D n  _ A ∈A ( q , ζ ) argmax i ∈ [ d ] V ℓ n ( A , i ) !

6∈ L ( A ) | L η ( A ) = ∅ ! ≤ ( − t ) (E.13)and the lemma follows. Lemma E.7.

For every partition P , under the Assumption 4.1, if for every A ∈ P it holds that L η ( A ) = ∅ , then E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P ( x )] (cid:19) ≤ C · η · E x ∈D x [ |L ( P ( x )) | ] . Proof.

We deﬁne P to be the partition where all the cells contain only one element of the space,that is P = (cid:8) { x } | x ∈ {

0, 1 } d (cid:9) . We then know that L ( P ) =

0. Let P ′ be the reﬁnement of P where every cell A of P has been split with respect to all the coordinates R ( A ; P ) . We ﬁrst provethat L ( P ′ ) =

0. If this is not the case then there exists a cell A and a direction i

6∈ R ( A ; P ) suchthat L ( P ′ , A , i ) − L ( P ′ ) <

0. But because of the Assumption 4.1 and item (4.) of Lemma A.4 wehave that C · ( L ( P , A , i ) − L ( P )) ≤ L ( P ′ , A , i ) − L ( P ′ ) < i

6∈ R ( A ; P ) .Now assume that L η ( A ) = ∅ and for the sake of contradiction also assume that L ℓ ( A ) > C · η · |L ( A ) | . Let { r , . . . , r k } be an arbitrary enumeration of the set L ( A ) . From the argumentbefore we have that L ℓ ( A , L ( A )) = r j of L ( A ) such that L ℓ ( A , { r , . . . , r j − } ) − L ℓ ( A , { r , . . . , r j } ) > C · η ,otherwise we would immediately have L ℓ ( A ) ≤ C · η · |L ( A ) | . But because of the diminishingreturns property of L ( · ) we have that C · (cid:0) L ℓ ( A ) − L ℓ ( A , r j ) (cid:1) ≥ L ℓ ( A , { r , . . . , r j − } ) − L ℓ ( A , { r , . . . , r j } ) > C · η ,but this last inequality implies r j ∈ L η ( A ) which contradicts with our assumption that L η ( A ) = ∅ . Hence L ℓ ( A ) ≤ C · η · |L ( A ) | and if we take expectation over x , the lemma follows.Finally we need one more Lemma to handle the case where Assumption 4.2 holds. Lemma E.8.

If d > , r ∈ [ d ] , t > , assume that m is ( β , r ) -strongly partition sparse with relevantfeatures R as per Assumption 4.2 and let n ≥ + q ζ ( q log ( d ) + t ) and n ≥ + q ( q log ( d )+ t ) ζ · β then we havethat P D n ∼D n  _ A ∈A ( q , ζ ) argmax i ∈ [ d ] V ℓ n ( A , i ) ! R | R \ I ( A ) = ∅ ! ≤ ( − t ) Proof.

Directly applying Lemma E.2 to Deﬁnition E.5 we have that P D n ∼D n  _ i , A ∈A ( q , ζ ) (cid:0) i ∈ L η n ( A ) | i

6∈ L ( A ) (cid:1) ≤ exp ( − t ) , (E.14)where η = q + q ( q log ( d )+ t ) ζ · n and n ≥ + q ζ ( q log ( d ) + t ) . Similarly we have that P D n ∼D n  _ i , A ∈A ( q , ζ ) (cid:0) i

6∈ L η n ( A ) | i ∈ L η ( A ) (cid:1) ≤ exp ( − t ) . (E.15)If we combine the above inequalities with Assumption 4.2 the lemma follows. Theorem E.9.

Let D n be i.i.d. samples from the non-parametric regression model y = m ( x ) + ε , wherem ( x ) ∈ [ − ] , ε ∼ E , E ε ∼E [ ε ] = and ε ∈ [ − ] with m an r-sparse function. Let also P n be the partition that Algorithm 2 returns. Then the following statements hold:1. Let q = C · rC · r + ( log ( n ) − log ( log ( d / δ ))) and assume that the approximate diminishing returnsAssumption 4.1 holds. Moreover if we set the number of nodes t such that log ( t ) ≥ q, and if wehave number of samples n ≥ ˜ Ω ( log ( d / δ )) then it holds that P D n ∼D n E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P n ( x )] (cid:19) > ˜ Ω C · r · C · r + r log ( d / δ ) n !! ≤ δ .43 . Suppose that the distribution D x is a product distribution (see Assumption 2.2) and that the As-sumption 4.1 holds. Moreover if log ( t ) ≥ r, then it holds that P D n ∼D n E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P n ( x )] (cid:19) > ˜ Ω r C · r · log ( d / δ )) n !! ≤ δ .

3. Suppose that the distribution D x is a product distribution (see Assumption 2.2), that is also ( ζ , r ) -lower bounded (see Assumption 4.3) and that the Assumption 4.1 holds. Moreover if log ( t ) ≥ r,then it holds that P D n ∼D n E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P n ( x )] (cid:19) > ˜ Ω C · s r · log ( d / δ )) ζ · n !! ≤ δ .

4. Suppose that m is ( β , r ) -strongly partition sparse (see Assumption 4.2) and that D x is ( ζ , r ) -lowerbounded (see Assumption 4.3). If n ≥ ˜ Ω (cid:16) r ( log ( d / δ )) ζ · β (cid:17) , and log ( t ) ≥ r, then we have P D n ∼D n E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P n ( x )] (cid:19) > ! ≤ δ . Proof.

When the value of level changes, then the algorithm considers separately every cell A in P level . For every such cell A it holds that L ℓ ( A , R ) = V ℓ ( A , R ) , V ∗ ( A ) ismaximized. Since m ( x ) ∈ [ −

1, 1 ] it holds that the maximum value of V ℓ is 1. Now let { i , . . . , i r } be an arbitrary enumeration of R and let R j = { i , . . . , i j } . Then by adding and subtracting termsof the form V ℓ ( A , R j ) we have the following equality (cid:0) V ℓ ( A , R ) − V ℓ ( A , R r − ) (cid:1) + · · · + (cid:0) V ℓ ( A , R ) − V ℓ ( A , i ) (cid:1) + V ℓ ( A , i ) = V ∗ ( A ) .From Assumption 4.1 we have that (cid:0) V ℓ ( A , i r ) − V ℓ ( A ) (cid:1) + · · · + (cid:0) V ℓ ( A , i ) − V ℓ ( A ) (cid:1) + (cid:0) V ℓ ( A , i ) − V ℓ ( A ) (cid:1) ≥ V ∗ ( A ) − V ℓ ( A ) C which implies max j ∈ [ r ] (cid:0) V ℓ ( A , i j ) − V ℓ ( A ) (cid:1) ≥ V ∗ ( A ) − V ℓ ( A ) C · r .Let i level A be the coordinate that the algorithm chose to split at cell A at level level. Now fromthe greedy criterion that Algorithm 2 uses to pick the next coordinate to split on, we have that i level A was at least as good as the best of the coordinates in R with respect to V ℓ n . If we set ζ = ˜ Θ (cid:18) C · r + q log ( d / δ ) n (cid:19) , q = C · rC · r + ( log ( n ) − log ( log ( d / δ ))) and ξ = ˜ Θ (cid:18) C · r + q log ( d / δ ) ζ · n (cid:19) and useLemma E.2 we get that if n ≥ ˜ Ω ( log ( d / δ )) and if P x ∼D x ( x ∈ A ) ≥ ζ /2 q then it holds withprobability at least 1 − δ that V ℓ (cid:16) A , i level A (cid:17) ≥ V ℓ ( A ) + V ∗ ( A ) − V ℓ ( A ) C · r − ξ This condition is necessary to guarantee that we have enough number of samples n to make the splits that arenecessary for the q splits of every path of the tree. L ∗ ( A ) , L ℓ ( A , R ) = L ℓ (cid:16) A , i level A (cid:17) ≤ L ℓ ( A ) (cid:18) − C · r (cid:19) + ξ . (E.16)We ﬁx Q level to be the partition P level of {

0, 1 } d when level changed. Also we deﬁne U to be theset of cells A in Q level such that P x ∼D x ( x ∈ A ) ≥ ζ /2 q and V the rest of the cells A in Q level .Then because of E.16 and Lemma A.4 it holds that L ( Q level + ) = ∑ A ∈Q level P x ∼D x ( x ∈ A ) L ℓ (cid:16) A , i level A (cid:17) + ξ = ∑ A ∈U P x ∼D x ( x ∈ A ) L ℓ ( A ) (cid:18) − C · r (cid:19) + ∑ A ∈V P x ∼D x ( x ∈ A ) L ℓ (cid:16) A , i level A (cid:17) + ξ ≤ L ( Q level ) (cid:18) − C · r (cid:19) + ζ q ∑ A ∈V L ℓ (cid:16) A , i level A (cid:17) + ξ ≤ L ( Q level ) (cid:18) − C · r (cid:19) + ξ + ζ .Inductively and using the fact that m ( x ) ∈ [ −

1, 1 ] , we get that L ( Q level ) ≤ L ( P ) (cid:18) − C · r (cid:19) level + ( ξ + ζ ) · level ≤ (cid:18) − C · r (cid:19) level + ( ξ + ζ ) · level. (E.17)Finally if we set η = ξ + ζ = Θ ( ξ ) from the choice of t we have that level ≥ C · r ln ( η ) andhence when level is exactly equal to C · r ln ( η ) , it holds that L ( Q level ) ≤ · C · r ln ( η ) η .Now from the monotonicity of the L function with respect to Q level we have that for any level ≥ C r ln ( η ) it holds that L ( Q level ) ≤ · C · r ln ( η ) η and the ﬁrst part of the theorem follows.For the second part of the theorem we use Lemma E.6 and we have that at every step eitherthe algorithm, if A ∈ A ( r , ζ ) then we chose to split with respect to a direction i ∈ L ( A ) or L η ( A ) = ∅ for η = q + r ( r log ( d )+ t ) ζ · n . Because of our assumption that m is r -sparse and becausewe assume that the features are distributed independently we have that at any step |L ( A ) | ≤ r .Hence at r levels it has to be that for every cell A ∈ A ( r , ζ ) ∩ P n , it holds that L η ( A ) = ∅ . Letnow U be the cells A ∈ P n such that A ∈ A ( r , ζ ) , and let V be the rest of the cells in P n . Thenusing Lemma E.7 we have that L ( P n ) = ∑ A ∈U P x ∼D x ( x ∈ A ) · L ℓ ( A ) + ∑ A ∈V P x ∼D x ( x ∈ A ) · L ℓ ( A ) ≤ C · η · r + ζ now setting ζ = ( C · r ) q + r ( r log ( d )+ t ) n we get L ( P n ) ≤ ζ and since L ( P ) is a monotone function, the second part of the theorem follows.For the third part of the theorem we use Lemma E.6 and we have that at every step ei-ther the algorithm chose to split with respect to a direction i ∈ L ( A ) or L η ( A ) = ∅ for η = q + r ( r log ( d )+ t ) ζ · n . Because of our assumption that m is r -sparse and because we assume45hat the features are distributed independently we have that at any step |L ( A ) | ≤ r . Hence at r levels it has to be that for every cell A , it holds that L η ( A ) = ∅ . Then using Lemma E.7 wehave that L ( P n ) ≤ C · η · r and since L ( P ) is a monotone function, the third part of the theoremfollows.The last part of the theorem follows easily from Lemma E.8.Recall the deﬁnition of the value-diameter from Deﬁnition C.1. We can prove the following. Theorem E.10.

Let D n be i.i.d. samples from the non-parametric regression model y = m ( x ) + ε , wherem ( x ) ∈ [ − ] , ε ∼ E , E ε ∼E [ ε ] = and ε ∈ [ − ] . If m is ( β , r ) -strongly partition sparse(see Assumption 4.2) and D x is ( ζ , r ) -lower bounded (see Assumption 4.3) then the following statementshold for the bias of the output of Algorithm 1..1. If n ≥ ˜ Ω (cid:16) r ( log ( d / δ )) ζ · β (cid:17) , and log ( t ) ≥ r, then it holds that P D n ∼D n ( ∆ m ( P n ) = ) ≥ − δ .

2. Let R be the set of relevant features, x ∈ {

0, 1 } d and assume that we run Algorithm 2 with inputh = and log ( t ) ≥ r. If n ≥ ˜ Ω (cid:16) r ( log ( d / δ )) ζ · β (cid:17) then it holds that (cid:18) E D n ∼D n [ m n ( x ; P n )] − m ( x ) (cid:19) ≤ δ . Proof.

For the ﬁrst part of the theorem we ﬁx any possible cell A after the ﬁrst r iterations of theAlgorithm 2. For simplicity of the exposition of this proof we deﬁne for every subset B of {

0, 1 } n the probability P B , P x ∼D x [ x ∈ B ] and the empirical probability ˆ P B , n ∑ ni = { x ( j ) ∈ B } . Usingthe multiplicative form of the Chernoff bound we get that P D n ∼D n  n ˆ P A ≥  − s ( δ ) nP A  nP A  ≥ − δ .Hence for n ≥ ( δ ) P A we have that P D n ∼D n n ∑ i = { x ( i ) ∈ A } ≥ ! ≥ − δ .Next we can apply a union bound over all possible cell A that split according to the R coordinatesand using our assumption that P A ≥ ζ r we get that for n ≥ r ( r + log ( δ ) ζ it holds that P D n ∼D n _ A n ∑ i = { x ( i ) ∈ A } ≥ !! ≥ − δ . (E.18)Now let Q r be the set of splits after r iterations of the Algorithm 2. Then the Lemma E.8 impliesthat L ( Q r ) =

0. Finally from (E.18) we also have that the partition Q r is the partition the fullpartition to all the cells of R and hence E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ P n ( x )] (cid:19) ≤ E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z R = x R ] (cid:19) − δ it holds that ∑ A ∈P n P x ∼D x ( x ∈ A ) · E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ A ] (cid:19) | x ∈ A = D x is ( ζ , r ) -lower bounded it has to be that for every cell A ∈ P n it holds that E x ∼D x "(cid:18) m ( x ) − E z ∼D x [ m ( z ) | z ∈ A ] (cid:19) | x ∈ A = ∆ m ( A ) = ∆ m ( P n ) = − δ .For the second part of the theorem we deﬁne for simplicity w ( j ) ( x ) = { x ∈P n ( x ( j ) ) } N n ( P n ( x ( j ) ) and hence m n ( x ) = ∑ ni = w ( j ) ( x ) y ( j ) and we have: (cid:18) E D n ∼D n [ m n ( x )] − m ( x ) (cid:19) == E D n ∼D n " n ∑ j = w ( j ) ( x ) ( y ( j ) − m ( x ( j ) )) + E D n ∼D n " n ∑ j = w ( j ) ( x ) ( m ( x ( j ) ) − m ( x )) Due to honesty, which is implied by h = w ( j ) ( x ) is independent of y ( j ) and we have that the ﬁrst term is equal to 0 by a tower law. Thus we have: (cid:18) E D n ∼D n [ m n ( x )] − m ( x ) (cid:19) = E D n ∼D n " s ∑ j = w ( j ) ( x ) ( m ( x ( j ) ) − m ( x )) ≤ E D n ∼D n  n ∑ j = w ( j ) ( x )( m ( x ( j ) ) − m ( x )) !  Let also A = { z | z R = x R } , then using the multiplicative form of the Chernoff Bound from theproof of the ﬁrst part of the theorem we get P D n ∼D n (cid:16) ∑ ni = { x ( i ) ∈ A } ≥ (cid:17) ≥ − δ . Thereforewith probability 1 − δ the path of the tree that leads to x has split all the relevant coordinates R and hence for all j such that w ( j ) ( x ) > x ( j ) R = x R which in turn implies that m ( x ( j ) ) = m ( x ) . With the rest δ probability the square inside the expectation is at most 1 since m ( · ) ∈ (cid:2) − , (cid:3) , hence we get (cid:18) E D n ∼D n [ m n ( x )] − m ( x ) (cid:19) ≤ δ . E.2 Proof of Theorem 4.4

Observe that the output estimate m n ( · ; P n ) and partition P n of Algorithm 2, satisﬁes the condi-tions of Lemma B.2. Moreover, since the number of vertices in a binary tree upper bounds the47umber of leafs we can apply Corollary B.3 and we have that the critical radius quantity δ n is oforder Θ (cid:18)q t log ( d t )( + log ( n )) n (cid:19) , if the total number of nodes is at most t . Thus applying the boundpresented in (B.4) with the bound on δ n we have the following cases each of the cases of Theorem4.4 by using the corresponding case of Theorem E.9. E.3 Proof of Theorem 4.5

From case 1. of Theorem E.10 and since the maximum possible value diameter is 1, we havethat if s ≥ ˜ Ω (cid:16) r ( log ( d / δ )) ζ · β (cid:17) then E D n ∼D n [ ∆ m ( P n )] ≤ δ which implies E x ∼D x [( m s ( x ) − m ( x )) ] ≤ δ .Putting this together with Lemma C.2 we get that if s = ˜ Ω (cid:16) r ( log ( d / δ )) ζ · β (cid:17) then P D n ∼D n (cid:18) E x ∼D x [( m n , s ( x ) − m ( x )) ] ≥ ˜ Ω (cid:18) r ( log ( d / ( δ · δ ′ ))) n · ζ · β (cid:19) + δ (cid:19) ≤ δ ′ .From the above we get Theorem 4.5 by setting δ = ˜ Ω (cid:16) r ( log ( d / δ ′ )) n · ζ · β (cid:17) . F Proofs of Asymptotic Normality

F.1 Proof of Theorem 3.5

We deﬁne m s , π to be the output of the Algorithm 1 when the samples have been permuted bythe permutation π ∈ S s . We denote by S n the uniform distribution over the symmetric group S n .We also deﬁne m s ( x ) = E D n ∼D n , τ ∼S [ m n , s , τ ( x )] = E D s ∼D s , π ∼S s [ m s , π ( x )] . (F.1)where the last inequality follows due to symmetry of the distribution D n and the deﬁnition of m n , s , τ . We also remind that m n , s is equal to E τ ∼ S n [ m n , s , τ ] and that m n , s , B is the Monte Carloapproximation of m n , s with B terms. We now have the following. σ − n ( x )( m n , s , τ ( x ) − m ( x )) = σ − n ( x )( m n , s , B ( x ) − m n , s ( x )) + σ − n ( x )( m n , s ( x ) − m s ( x ))+ σ − n ( x )( m s ( x ) − m ( x )) . (F.2)We deﬁne for simplicity w ( j ) ( x ) = { x ( j ) T n ( S , x ) = x T n ( S , x ) } N n ( x ; T n ( S , x )) . By Theorem 2 of [FLW18] we have that: σ − n ( x )( m n , s ( x ) − m s ( x )) → N (

0, 1 ) (F.3)48here: σ n ( x ) = s n Var x ( ) ∼D x E D s ∼D s , τ ∼S n " s ∑ j = w ( j ) ( x ) y ( j ) | x ( ) , y ( ) ≥ s n E x ( ) ∼D x (cid:20) E D s ∼D s , τ ∼S n h w ( ) ( x ) | x ( ) i σ ( x ( ) ) (cid:21) ≥ s σ n E x ( ) (cid:20) E D s ∼D s , τ ∼S n h w ( ) ( x ) | x ( ) i (cid:21) ≥ s σ n E D s ∼D s , τ ∼S n h w ( ) ( x ) i = s σ n s = σ n Where the last inequality follows by our assumption that σ ( x ) ≥ σ , uniformly for all x and thefact that due to the expectation over the random permutation τ in the beginning of the algorithmwe have symmetry between the samples and hence: E D n ∼D n , τ ∼S n [ w ( ) ( x )] = ( s ) . Also, since s ≥ ˜ Ω (cid:16) r ( log ( d / δ )) β + r log ( n ) ζ (cid:17) , from part 3 of Theorem D.10 with δ = n we have that: σ − n ( x ) ( m s ( x ) − m ( x )) = o p ( ) (F.4)where we have used the fact that the part 3 of Theorem D.10 holds pointwise for every per-mutation τ ∈ S n . The last step is to bound the error from the Monte Carlo approximation m n , s , B of m n , s . Since we have ﬁxed the x ∈ {

0, 1 } before the execution of the algorithm andsince m n , s , τ ( x ) ∈ [ − ] , we can use the Hoeffding bound with B = n log ( n ) and get thefollowing P (cid:18) | m n , s , B ( x ) − m n , s ( x ) | ≥ n (cid:19) ≤ n where the probability is over the randomness that is used to sample the B permutations uni-formly from S n to compute the empirical expectation m n , s , B . Hence we have that σ − n ( x ) ( m n , s , B ( x ) − m n , s ( x )) = o p ( ) . (F.5)Finally, putting together (F.2), (F.3), (F.4), (F.5) and invoking Slutzky’s theorem we get that: σ − n ( x )( m n , s , B ( x ) − m ( x )) → d N (

0, 1 ) . F.2 Proof of Theorem 4.6

The proof is almost identical to the proof of Theorem 3.5 presented in the previous section. Theonly difference is the derivation of (F.4). For this instead of using part 3 of Theorem D.10 weuse part 2 of Theorem E.10 again with δ = n . The rest of the proof remains the same andTheorem 4.6 follows. 49 Necessity of Submodularity

Let m : {

0, 1 } d → [ −

1, 1 ] be a 2-sparse function such that m ( x ) = x + x − x x where andassume that the feature vector x is sampled uniformly at random from {

0, 1 } d and there is nonoise, i.e. ε i = i ∈ [ n ] . Then it is easy to see the approximate submodularity does notholds for any constant C . This is due to the fact that V ( { } ) − V ( ∅ ) = V ( {

1, 2 } ) − V ( { } ) >

0. So our theorem do not apply in this case. Nevertheless, we next arguethat this is not a limitation of our analysis but a limitation of the greedy algorithms that we areanalyzing in this paper.More precisely, Then it is easy to see that even with inﬁnite number of samples, in the level-split model for any S ⊆ [ d ] such that S ∩ {

1, 2 } = ∅ it holds that V ( S ∪ { } ) = V ( S ∪ { } ) = V ( S ∪ { j } ) = V ( S ) .This implies that until the greedy algorithm picks the coordinates 1 or 2 these relevant coordi-nates have the same mean square error reduction as any other coordinate. Hence the greedyalgorithm picks at every step a coordinate at random with probability 1/ ( d − | S | ) . This meansthat we need depth at least Ω ( d ) to get small mean square error for this function m . This impliesthat we need at least 2 Ω ( d ))