[PDF] Asymptotic Normality for Multivariate Random Forest Estimators

Abstract

Regression trees and random forests are popular and effective non-parametric estimators in practical applications. A recent paper by Athey and Wager shows that the random forest estimate at any point is asymptotically Gaussian; in this paper, we extend this result to the multivariate case and show that the vector of estimates at multiple points is jointly normal. Specifically, the covariance matrix of the limiting normal distribution is diagonal, so that the estimates at any two points are independent in sufficiently deep trees. Moreover, the off-diagonal term is bounded by quantities capturing how likely two points belong to the same partition of the resulting tree. Our results relies on certain a certain stability property when constructing splits, and we give examples of splitting rules for which this assumption is and is not satisfied. We test our proposed covariance bound and the associated coverage rates of confidence intervals in numerical simulations.

Full PDF

AAsymptotic Normality for Multivariate Random Forest Estimators

Kevin LiMassachusetts Institute of [email protected] ∗ December 16, 2020

Abstract

Regression trees and random forests are non-parametric estimators that are widely used in dataanalysis. A recent paper by Athey and Wager shows that the pointwise random forest estimate isasymptotically Gaussian. In this paper, we extend their result to the multivariate case and showthat a vector of estimates, taken at multiple points, is jointly asymptotically normal. Speciﬁcally,the covariance matrix of the limiting normal distribution is diagonal, so that the estimates at anytwo points are independent in suﬃciently deep trees. We show that the oﬀ-diagonal terms arebounded by quantities related to the probability of two given points belonging in the same leaf ofthe resulting tree. Our results rely on certain stability properties of the underlying tree estimator,and we give examples of splitting rules for which this holds. We also provide a heuristic andnumerical simulations for gauging the decay of the oﬀ-diagonal term in ﬁnite samples.

Trees and random forests are non-parametric estimators ﬁrst introduced by Breiman [1]. Givena feature space

X ⊂ R p and a set of data points { ( X i , Y i ) } ⊆ X × R , tree estimators recursivelypartitions the feature space into axis-aligned non-overlapping hyperrectangles by repeatedly splitting X along a given axis. The prediction of the tree estimator at a test point x ∈ X is then an aggregateof the targets Y i ’s that land in hyperrectangle containing x ; when Y i is continuous, the aggregate isthe sample mean and the tree is also known as a regression tree. The depth of a tree estimator—deﬁned as the maximal number of splits taken before reaching a terminal hyperrectangle—controlsthe complexity of the tree estimator. There are two popular methods for controlling complexity:the “boosting” approach grows trees of large depth, then reduces complexity by either trimming thetree (i.e., so that predictions are made at a non-terminal hyperrectangle) or introducing a decayfactor; the “bagging” approach instead grows a collection of shallow trees on diﬀerent subsets of thedata, and averages over the trees for the ﬁnal prediction. The intuition for bagging is that treesgrown on diﬀerent subsets are not perfectly correlated, so that aggregation reduces variance and ∗ I would like to thank my advisors Alberto Abadie and Victor Chernozhukov for reviewing multiple drafts of thispaper. In addition, Sophie (Liyang) Sun, Ben Deaner, participants in the seminar class 14.386 (Spring 2020), andparticipants of MIT’s Econometrics Lunch seminar provided very useful feedback. I would also like to thank ProfessorStefan Wager for helping me understand mechanics of covariance estimates in random forest models. When the feature space X need not be rectangular, one may always enlarge X to a rectangular set X (cid:48) that isdeﬁned to the intersection of all rectangular sets containing X . a r X i v : . [ ec on . E M ] J a n alances the bias-variance tradeoﬀ. Estimators of this type are called random forests, and they arethe focus of this paper.Since their introduction in the early 2000s, random forests have become an increasingly importanttool in applied data analysis, owing to a multiple of practical advantages over competing models.First, high-quality random forest libraries are readily available, with popular implementations thatscale to hundreds of distributed workers [2, 3]. Moreover, the core algorithm behind tree estimatorsand random forests are simple enough to allow for rapid prototyping of bespoke implementations,e.g. [4]. Another advantage of tree-based methods is that they can ingest real-world data withoutmuch issue: continuous, discrete, and ordered categorical features may be freely mixed [5], modelestimates are immune to feature outliers, and missing data may be easily incorporated. Firstly, theirconstruction naturally aligns with the spatial locality of most applied: that is, the underlying targetfunction relating Y to X is continuous. Finally, tree models are interpretable, with well-deﬁnednotions of feature importance [6, 7], which supports their use as model selection tools [8].Within economics, random forests may be fruitfully applied to estimate heterogeneneous treat-ment eﬀects. In Rubin’s potential outcomes framework [9] (see [10] for an overview), an individual i is associated with two potential outcomes Y (0) and Y (1) , with one of the outcomes being realizeddepending on whether i undergoes treatment. The statistician has access the IID observations { ( X i , W i , Y i ) : 1 ≤ i ≤ n } , where X i is a vector of observed covariates for individual i , W i ∈ { , } is (an encoding of) their treatment status, and Y i = Y ( W i ) i is her realized outcome. One quantity ofinterest is the treatment eﬀect at xτ ( x ) : − E ( Y (1) − Y (0) | X i = x ) . (1)Since only one of Y (0) i and Y (1) i is observed, consistent estimation of τ ( x ) requires further distribu-tional assumptions. A common assumption is unconfoundedness, i.e., that treatment status W i isindependent of Y (1) and Y (0) conditional on X i . Under this assumption, τ ( x ) = E (cid:20) Y i (cid:18) W i e ( x ) − − W i − e ( x ) (cid:19) | X i = x (cid:21) , where e ( x ) = P ( W i = 1 | X i = x ). (2)Here, the key function is e ( x ), known as the propensity score, is the probability of treatment forthe subpopulation with covariates x ; see [11] the derivation and implications. Machine learningmethods—including random forests—may be brought to bear on the problem by estimating e ( x ).Alternatively, unconfoundedness also implies τ ( x ) = E ( Y | W = 1 , X = x ) − E ( Y | W = 0 , X = x ) , (3)so that τ ( x ) may be estimated by ﬁtting two models, one on the subset of the sample in which W = 1, and the other on W = 0. Splits on discrete features partitions that variable into two arbitrary non-empty sets; no changes are needed forordered categorical features.

2n econometric applications, conducting pointwise inference on the target function f : X → R (e.g., to test the null hypothesis H : f ( x ) = 0) requires knowledge about about the rate ofconvergence or asymptotic distribution of the underlying estimator ˆ f ( x ), where x is the point ofinterest. However, functionals of target function are often also of interest: for example, the diﬀerenceof treatments eﬀects (i.e., f = τ ) for two diﬀerent subpopulations is captured by the quantity f ( x ) − f (¯ x ) , (4)where x and ¯ x are covariates describing the two subpopulations. More generally, we might also beinterested in a weighed treatment eﬀect, where a subpopulation x is given an importance weightmodeled as a density µ ( x ). In this case, the corresponding functional of f is (cid:90) x ∈X f ( x ) dµ, where µ is not necessarily the density of x, (5)and the integral is taken over the domain X .Inference on functionals of f requires not only the asymptotic distribution of the point estimate f ( x ), but also the correlation between estimates at diﬀerent points f ( x ) and f (¯ x ). As a concreteexample, consider the function τ ( x ) and the simple diﬀerence τ ( x ) − τ (¯ x ). We have τ ( x ) − τ (¯ x ) = [ E ( Y | W = 1 , X = x ) − E ( Y | W = 1 , X = ¯ x )] − [ E ( Y | W = 0 , X = x ) − E ( Y | W = 0 , X = ¯ x )]=: A − B. (6)We may estimate the diﬀerence by estimating A and B separately, ﬁtting a random forest model tothe two “halves” of the dataset where W i = 1 and W i = 0, as discussed above. The estimators ˆ A and ˆ B obtained are thus independent, so that Var ( ˆ A − ˆ B ) = Var ˆ A + Var ˆ B . The variances ˆ A andˆ B then depend on the covariance of their respective random forest estimates at x and ¯ x .This paper studies the correlation structure of a class of random forests models whose asymptoticdistributions were ﬁrst worked out in [12]. We ﬁnd suﬃcient conditions under which the asymptoticcovariance of random forest estimates at diﬀerent points vanish relative to their respective variances;moreover, we provide ﬁnite sample heuristics based on our calculations. To the best of our knowledge,this is the ﬁrst set of results on the correlation structure of random forest estimators.The present paper builds on and extends the results in [12], which in turn builds on relatedwork [14] on general concentration properties of trees and random forest estimators. See also [13],which extends the random forest model considered here to a broader class of target functions byincorporating knowledge of moment conditions. Stability results established in this paper haveappeared in [15], who study the notions of algorithmic stability for random forests and logisticregression and derive generalization error guarantees. Also closely related to our paper are [16] and[17], concerning ﬁnite sample Gaussian approximations of sums and U -statistics in high dimensions.In this context, our paper provides a stepping stone towards applying the theory of ﬁnite sample3 -statistics to random forests, where bounds on covariance matrices plays a central role.The paper is structured as follows. In Section 2, we introduce the random forest model andstate the assumptions required for our results; Section 3 contains our main theoretical contributions;Section 4 builds on Sections 3 and discusses heuristics useful in ﬁnite sample settings; Section 5concludes. All proofs are found in the appendix. The goal of this paper is to study asymptotic Gaussian approximations of random forest estimators.Throughout, we assume that a random sample { Z i = ( X i , Y i ) : 1 ≤ i ≤ n } ⊆ X × R is given,where each X i is a vector of features or covariates belonging to a subset X ⊆ R p of p -dimensionalEuclidean space, and Y i ∈ R is the response or target corresponding to X i . We will refer to X asthe feature space or the feature domain.Given the data set { Z i } ni =1 , a tree estimator recursively partitions the feature space by makingaxis aligned splits. Speciﬁcally, an axis-aligned split is a pair ( j, t ) where j ∈ { , . . . , p } is the splitting coordinate and t ∈ R is the splitting index ; given a subset R ⊆ X , a split ( j, t ) divides R into left and right halves { x ∈ R : x j < t } and { x ∈ R : x j > t } , (7)where x j denotes the j -th coordinate of the vector x . Starting with the entire feature space X , therecursive splitting algorithm computes a (axis-aligned) split based on the data { Z i : 1 ≤ i ≤ n } ; forexample, when the target Y i continuous, a popular choice is( j, t ) = arg min ˜ j, ˜ t (cid:88) i : X i ∈ L ( Y i − µ L ) + (cid:88) i : X i ∈ R ( Y i − µ R ) (8)where L = L (˜ j, ˜ t ) and R = R (˜ j, ˜ t ) are the two halves of X obtained by the split (˜ j, ˜ t ), with µ L and µ R being the averages of targets Y i whose corresponding feature land in L and R , respectively.After the ﬁrst split, X is split into two halves L and R . The process is then repeated for L and R separately, in that a split for L is computed by using the subset of the data whose features X i belong in L , and likewise for R . Each of the halves is then split again, and so on, until a stoppingcriterion is met. The process completes when the stopping criterion is satisﬁed for each node; atthis point, the collection of halves form a partition of X , with each partition—and all the halvesthat came before it—being the intersection of a hyperrectangle with X . The sequence of splitscorresponds to a tree in the natural way; we will call the halfspaces that arise during the splittingas nodes , and elements of the ﬁnal partition terminal nodes . According to (7), we exclude edge cases where points lead on the an “edge” of the rectangle. This is not an issuefor continuous variables, while for categorical variables, the deﬁnition should be slightly changed so that one of thehalves contain x j = t . In this paper, we deal only with continuous features. N , . . . , N q of terminal nodes (which form a partition of X ), the predictionof the tree at generic test point x ∈ X is the average of the responses that belong in the sameterminal node as x T ( x ; ξ, Z , . . . , Z n ) = q (cid:88) j =1 ( x ∈ N j ) 1 | N j | (cid:88) i : X i ∈ N j Y j , (9)where the outer sum runs over observations i for which X i belongs to the partition N j , and | N j | is the number of such observations. The input ξ is an external source of randomization to allowfor randomized split selection procedures. Thus, T ( x ; ξ, Z , . . . , Z n ) refers to prediction at x for atree grown using data { Z , . . . , Z n } with randomization parameter ξ . As a function of x , keeping Z , . . . , Z n and ξ ﬁxed, T : X → R is then a step function, i.e., a linear combination of indicatorfunctions of rectangular sets.We note here that equations (8) and (9) are not the only possible choices; in particular, the ruleused to choose the optimal split may be path dependent (i.e., dependent on previous splits) as inpopular implementations (see [3], which allows for random forests but uses gradient boosting afterthe initial split), and the ﬁnal prediction rule (9) may instead do a ﬁnal linear ﬁt or use weightedaverage (c.f., [2, 3]). Analysis of tree (and random forest) models is complicated in these cases, sothe present paper stipulates that the algorithm uses a splitting rule that is “similar” to similar to(8) (c.f. Proposition 4) and uses (9) as the ﬁnal prediction. In particular, this implies that thetarget estimated by the tree estimator is the regression function x (cid:55)→ E ( Y | X = x ). Given a speciﬁc tree estimator T , i.e., given a set of splitting rules which deﬁnes the tree estimator, wedeﬁne the random forest estimator to be the average of the tree estimator across all (cid:0) ns (cid:1) subsamples,marginalizing over the randomization device ξ . Speciﬁcally, the random forest estimate RF( x ) at x ∈ X , given data { Z , . . . , Z n } , is deﬁned to beRF( x ; Z , . . . , Z n ) = 1 (cid:0) ns (cid:1) (cid:88) i ,...,i s E ξ T ( x ; ξ, Z i , . . . , Z i s ) , (10)where the summation runs over size- s subsets of { , . . . , n } , and the inner expectation is taken withrespect to ξ . Importantly, note each tree is grown on a subsample of size s < n . We follow [12] inassuming that s ∼ n β for some β suﬃciently close to one; speciﬁcally, we assume throughout thatthe subsample size is chosen as to satisfy the assumptions of Theorem 3 of [12], so that—alongwith other assumptions to be introduced presently—the random forest estimator RF is a consistentestimator of the target function x (cid:55)→ E ( Y | X = x ) (see discussion above).In keeping with the notation [12], we will write T ( x ; Z , . . . , Z s ) to refer to the expectation of T ( x ; ξ, Z , . . . , Z s ) over ξ . With this notation, the random forest (at x ) estimator (10) is a U -statisticwith the size s kernel ( Z , . . . , Z s ) (cid:55)→ T ( x, Z , . . . , Z s ). We will discuss the U -statistic representationof RF in greater detail in Section 3. 5 .3 Discussion of Model Assumptions As our results will be an extension of the results in [12], we will study the same model of randomforests and adopt a similar set of assumptions. The assumptions regarding tree estimators haveappeared before in [14], while the distributional assumptions on the conditional moments of Y arestandard (see e.g., Chapters 7 and 9 in [18]).The ﬁrst—and most bespoke—assumption is that the tree algorithm is honest. Intuitively, honestystipulates that that knowledge of the tree structure does not aﬀect the conditional distribution oftree estimates when the features are ﬁxed. Assumption 1 (Honesty) . The target Y i and the tree structure (i.e., the splitting coordinates andsplitting indices) are independent conditional on X i . Speciﬁcally, we require dist( Y i | X i , S ) = dist( Y i | X i ) , (11) for all observations i where Y i participates in the ﬁnal prediction, where S is set of splits chosen bythe tree algorithm. (The second equality is automatic as a consequence of independent observations.) There are several ways to satisfy this assumption. The ﬁrst is to calculate splits based only thefeatures X i only. This rules out out the example splitting rule given in (8), so we could instead useits analog in feature space( j, t ) = arg min ˜ j, ˜ t (cid:88) i : X i ∈ L (cid:107) X i − µ L (cid:107) + (cid:88) i : X i ∈ R (cid:107) X i − µ R (cid:107) , (12)where here µ L and µ R denote the average (i.e., center of mass) of the X i in each halfspace. Inthis instance, the choice of splits is essentially a clustering algorithm that ﬁnds the best divisionof the sample points into two parts. Another way to satisfy to the honesty assumption while stillcomputing splits based on the targets is to use sample splitting. The data is partitioned into twoparts I and I ; observations in I and X i ∈ I may be freely used during the splitting process,while Y i ∈ I are used to determine terminal node values. In this case, equality in (11) is requiredto hold for i ∈ I . Finally, a third method satisfy honesty requires the existence of auxiliary data { W i } . During the splitting stage (“model ﬁtting”), splits are computed as if the response variableis W i ; for example, (8) is used with µ L and µ R being the average of the { W i } . Once the tree is fullygrown, predictions (“model inference ”) are made using Y i ’s as usual. The practice of using such surrogate targets is especially popular in time-series prediction, where diﬀerent horizons are used inﬁtting and inference steps (c.f., [21]).In the present paper, for simplicity of notation, we shall assume that the ﬁrst scheme is used tosatisfy honesty—namely, splitting decisions are based on the feature vectors X i only. Our results The terminology ‘inference’ is used to mean computing the predictions of an existing model, which is unrelated tothe typical usage of ‘inference’ in econometrics. The former terminology is standard in applied settings, used whendescribing a data pipeline: see the documentation of [19, 20]. Our next assumption will ensure that each one of the p axes is chosen as the splitting coordinatewith a probability bounded from below. Assumption 2 (Randomized Cyclic Splits) . When computing the optimal split, the algorithm ﬂipsa a probability δ coin that is entirely independent of everything else. The ﬁrst time the coin landsheads, the ﬁrst coordinate is chosen as the splitting coordinate; the second time, the second coordinateis chosen, and so on, such that on the J -th time the coin lands heads, the ( p mod J ) + 1 -th coordinateis chosen . After the random splitting coordinate is chosen, the splitting index may still be chosenbased on the observations. This is a modiﬁcation of the random splitting assumption in [12], in which each of the p axeshas a probability δ of being chosen at each split. This could be directly implemented by ﬂipping a pδ and selecting one of the p coordinates uniformly at random to split when the coin lands heads.Another method, studied in [13] and implemented by popular libraries such as [3], uses randomizesthe number of available splitting axes in each round. Speciﬁcally, a Poisson random variable Q withintensity proportional to √ p is ﬁrst realized ( Q is realized independently from round to round).Afterwards, min( Q, p ) many axes are uniformly selected as potential candidates for splitting inthat round. Clearly, this also leads to a lower bound on the probability of being chosen for eachcoordinate 1 ≤ j ≤ p .Each of the two methods above involves two separate rounds of randomization: ﬁrst, a randomvariable encoding the decision to split randomly is made (i.e., the coin or the random variable Q ),and second, the splitting axis is then determined. Intuitively, our cyclic splitting assumption aboveforgoes the second randomization step: in doing so, the variance of the number of times that anycoordinate is chosen is reduced. Importantly, the variance will depend only on δ and not on p , whichwe will exploit in our proofs. Assumption 3 (The Splitting Algorithm is ( α, k )-Regular) . There exists some α ∈ (0 , / suchthat whenever a split occurs in a node with m sample points, the two hyper-rectangles contains atleast αm many points each. Moreover, splitting ceases at a node only when the node contains lessthan k − points for some k . This key assumption is carried over from [12] and contains two requirements. The ﬁrst requiresthat no split may produce a halfspace containing too few observations, i.e., that both halfspacesare large when measured the by count of observations. As shown in [14], this implies that withexponentially small complementary probability, the splitting axis shrinks by a factor between α and1 − α , so that both halfspaces also large in Euclidean volume (with high probability).The second half of the assumption places an upper bound on the number of observations interminal nodes. Trees grown under this assumption will necessarily be deeper as the sample size n As in [12], constants appearing in our bounds may change in scheme two. We adopt the convention that mJ mod J = 0, hence notation for adding 1 to ( p mod J ). Splitting for that node ceases if Q = 0. s = n β ) increases. In particular, the predictions at leaf nodes—averagesof observations Y i — will be averages of a bounded number of terms. An important consequence isthat the variance of the tree estimator (at any test point x ) is bounded below (c.f., the distributionalassumption on Var ( Y | X = x )). Assumption 4 (Predetermined Splits) . The candidate splits considered at each node do not dependon the data { Z i } , so that they are ﬁxed ahead of time. Furthermore, the number of candidate splitsat each node is ﬁnite, and every candidate split shrinks the the length of its splitting axis by at mosta factor α . The predetermined splitting assumption is speciﬁc to our paper. That candidate splits consideredat each node are data-independent is “almost” without loss of generality since the feature space X is ﬁxed. For example, implementations of random forests typically use 32-bit ﬂoating point numbersas their splitting index, so that the assumption is automatically satisﬁed. Furthermore, we notethat the assumption allows candidates splits to depend on the node, so that the splitting process isstill data-driven insofar as the sequence of splits leading up to node depends on the observations.One interpretation of this assumption is that it aids in data compression. For example, supposethat all features are continuous and X = [0 , p , and that all candidate splits have the form ( j, k/ m )for some integers 1 ≤ j ≤ p and 0 ≤ k < m , where m ≥ {(cid:98) m X ij (cid:99) : 1 ≤ i ≤ n, ≤ j ≤ p } instead of { X ij } . Each member of the former set is an integer in { , . . . , m − } and thus couldbe represented using m bits. In particular, for a grid resolution of 2 = 256—a ﬁne grid even inmoderately large dimensions—each coordinate of the feature vectors X i may be stored in a singlebyte. Since modern CPUs and graphics processors store ﬂoating point numbers in four or eightbytes, this allows for a substantial reduction, allowing computation power to scale to to largerdatasets. The process of encoding in this way is known as quantizing , which is an option supportedby many popular software packages. In this way, though the predetermined split assumption mayseem at ﬁrst glance restrictive, it aligns our model more closely with practice. Assumption 5 (Distributional Assumptions on the DGP of (

X, Y )) . The features X i are supportedon the unit cube X = [0 , p with a density that is bounded away from zero and inﬁnity. Furthermore,the functions x (cid:55)→ E ( Y | X = x ) , x (cid:55)→ E ( Y | X = x ) , and x (cid:55)→ E ( Y | X = x ) are uniformlyLipschitz continuous. Finally, the conditional variance Var ( Y | X = x ) is bounded away from zero,i.e., inf x ∈X Var ( Y | X = x ) > . The continuity and variance bound assumptions are standard. Note that a consequence ofcontinuity and compactness of the hypercube is that the conditional moments up to order three arebounded. Our results will not explicitly depend on knowledge of the density of X : however, thedensity will aﬀect the implicit constants that we carry throughout our proofs (c.f., Lemma 3.2 andTheorem 3.3 in [12]). 8 Gaussianity of Multivariate U -Statistics We begin investigation of the random forest estimator in this section. As discussed in the modelintroduction, the random forest estimator RF( x ) at a test point x is a U -statistic where the kernel istree estimator T ( x ) marginalized over external randomizations. This paper studies the multivariate distribution of RF, speciﬁcally the correlation structure between RF( x ) and RF(¯ x ) at distinct points x and ¯ x ∈ X . Towards that end, we shall ﬁx a collection of q test points x , . . . , x q ∈ X (13)throughout the remainder of the paper. As these points will remain ﬁxed, for notational brevity wewill omit their explicit dependence when writing estimators. Therefore, RF( Z , . . . , Z n ) stands forthe q -dimensional estimator that is the random forest evaluated at x , . . . , x q , given observations { Z i : 1 ≤ i ≤ n } . As a consequence of notation, most of our equations are to be understood in R q , with equality and and arithmetic acting coordinate-wise. Finally, when there is no confusion,subscripts (typically k or (cid:96) ) typically denote a speciﬁc coordinate, i.e., the estimate at the k -th or (cid:96) -th test point; a notable exception is T , which refers to a Hajek projection that we now describe. We start by reviewing properties of the H¨oeﬀding Decomposition of U -statistics, also known asHajek projections; see [22] for a textbook treatment of the univariate case. Let f ( Z , . . . , Z m ) ∈ R q be a generic q -dimensional statistic based on m observations. The Hajek projection of f is deﬁnedto be ˚ f ( Z , . . . , Z m ) = m (cid:88) i =1 E [ f ( Z , . . . , Z m ) | Z i ] − ( m − E f ( Z , . . . , Z m ) . (14)That is, it is the coordinate-wise projection of f to the linear space spanned by functions of theform { g ( Z i ) : 1 ≤ i ≤ m } . In particular, when f is symmetric in its arguments and Z , . . . , Z m is anIID sequence, we have ˚ f ( Z , . . . , Z m ) = m (cid:88) i =1 f ( Z i ) − ( m − E f, (15)where is the function such that f ( Z ) = E ( f | Z ), i.e., f ( z ) = E ( f | Z = z ).In our setting, applying the Hajek projection to the centered statistic RF − µ , where µ is theexpectation of RF, yields˚RF( Z , . . . , Z n ) − µ = n (cid:88) i =1 E (RF − µ | Z i ) = 1 (cid:0) ns (cid:1) n (cid:88) i =1 E (cid:20) (cid:88) i ,...,i s E ξ T ( ξ ; Z i , . . . , Z i s ) − µ | Z i (cid:21) , (16)where i , . . . , i s run through the (cid:0) ns (cid:1) size- s subsets of { , . . . , n } as usual. (Recall that RF, µ , and T are9ll vectors in R q .) Since the samples Z , . . . , Z n are independent, E ( E ξ T ( ξ ; Z i , . . . , Z i s ) | Z i ) = µ whenever i / ∈ { i , . . . , i s } . As { i , . . . , i s } runs over the size- s subsets of { , . . . , n } , there are exactly (cid:0) n − s − (cid:1) many which contain i . For each of of these subsets, E ( E ξ T ( ξ ; Z i , . . . , Z i s ) − µ | Z i ) =: T ( Z i ) − µ, (17)where T ( z ) = E ξ,Z ,...,Z s T ( ξ ; z, Z , . . . , Z s ). Therefore,˚RF − µ = 1 (cid:0) ns (cid:1) n (cid:88) i =1 (cid:18) n − s − (cid:19) ( T ( Z i ) − µ ) = sn n (cid:88) i =1 ( T ( Z i ) − µ ) . (18)The sequence of observations Z , . . . , Z n is assumed to IID, and this property is preserved forthe sequence { T ( Z i ) : 1 ≤ i ≤ n } of projections. It is easily veriﬁed that E ( ˚RF) = µ , and the pointof the previous equation is that it expresses the centered statistic ( ˚RF − µ ) as an average of centeredIID terms, scaled by s . This will be our main entry point in establishing asymptotic joint normality. The standard technique in deriving the asymptotic distribution of a U -statistic is to establish alower bound on the variance of its Hajek projection; this is the approach taken by [12] and we followthe approach here. Let V be the variance of ˚RF; using (18), we have V = Var (cid:20) sn n (cid:88) i =1 ( T ( Z i ) − µ ) (cid:21) = s n Var ( T ( Z )) = sn Var (cid:20) s (cid:88) i =1 T ( Z i ) (cid:21) = sn Var ˚ T ∈ R q × q , (19)where ˚ T is the Hajek projection of the statistic T as in (15), where T = E ξ T ( ξ, Z , . . . , Z s ) ∈ R q .Since the third moment E ( Y | X = x ) is assumed to be bounded, conditions for the LindbergCentral Limit Theorem [23] easily follows, and applying the triangular CLT, we have the familiarfact V − / ( ˚RF − µ ) dist === ⇒ N (0 , I ) , (20)where 0 is the zero vector in R q and I the q × q identity matrix. Remark.

The condition that the third moment is bounded is not necessary. The Lindberg conditionswere directly veriﬁed in [12] (c.f., Theorem 8) and their results—without assuming a bounded thirdmoment—apply to our setting as well. More recently, triangular array CLTs speciﬁc to U -statisticswere developed in [24], and their conditions are satisﬁed for our case as well.The asymptotic normality of the random forest estimator RF can be related to the asymptoticnormality of ˚RF via V − / (RF − µ ) = V − / (RF − ˚RF) + V − / ( ˚RF − µ ) . (21)Since the second summand on the RHS is asymptotically normal, by Slutsky’s Theorem, V − / ( RF − ) is asymptotically normal once we establish the convergence V − / (RF − ˚RF) P −−→

0. The strategyis to show that e = V − / (RF − ˚RF) converges in squared mean. We may develop its squared normvia E ( e (cid:124) e ) = E (RF − ˚RF) (cid:124) V − (RF − ˚RF) = E tr V − (RF − ˚RF)(RF − ˚RF) (cid:124) = tr V − E (RF − ˚RF)(RF − ˚RF) (cid:124) = tr V − / Var (RF − ˚RF) V − / , (22)where we used the identity tr( ABC ) = tr(

BCA ) for conforming matrices A , B , and C . That traceon the extreme RHS goes to zero is the natural multivariate generalization of the familiar condition Var ( f − ˚ f ) Var ˚ f → U -statistics [22].In the univariate setting, the previous condition is checked by considering higher order decompo-sitions proceeds by considering higher order decompositions of the statistic f ; this approach is alsovalid in the multivariate setting, as shown below. As we will see, the more substantive diﬃculty isthat the dimension of the kernel of the U -statistic—namely, the subsample size s of each tree—growswith the sample size.By Proposition 1 below, we may expand RF − ˚RF according to a H¨oeﬀding decomposition taken respect to the matrix V − ,RF − ˚RF = 1 (cid:0) ns (cid:1) (cid:20)(cid:88) i

Var ˚ T − and Var T . Speciﬁcally, the authors obtain( Var T ) kk ( Var ˚ T ) kk ≤ c (log s ) p , for each k = 1 , . . . , q, (28)for some constant c . As we will see in the next section, the required bound on the trace will followby developing bounds on the oﬀ-diagonal elements of Var ˚ T , i.e., bounds on the covariance betweenrandom forest estimates at diﬀerent test points (see discussion following Proposition 2). Proposition 1 (H¨oeﬀding Decomposition for Multivariate U -statistics) . Fix a positive deﬁnitematrix M . Let f ( x , . . . , x n ) ∈ R p be a vector-valued function that is symmetric in its argumentsand let X , . . . , X n be a random sample such that f ( X , . . . , X n ) has ﬁnite variance. Then thereexists functions f , f , . . . , f n such that f ( X , . . . , X n ) = E ( f ) + n (cid:88) i =1 f ( X ) + (cid:88) i

The aim of this section is to establish asymptotic bounds on the oﬀ-diagonal elements of the thecovariance matrix

Var ˚ T . We shall show that when the tree estimator employs splitting algorithmssatisfying suitable stability conditions , we have the asymptotic behavior( Var ˚ T ) k,l = o ( s − (cid:15) ) for all 1 ≤ k (cid:54) = l ≤ q and some (cid:15) > . (31)Before proceeding, we ﬁrst show that that this bound, coupled with control on the diagonalterms (28), suﬃces to establish the trace bound in (27) vanish. Proposition 2.

The entries of

Var T are bounded and its diagonal entries are bounded away fromzero. Furthermore, when Var ˚ T satisﬁes the condition in (31) , sn tr( Var ˚ T − Var T ) → . (32) Remark.

The ﬁrst part of the Proposition, concerning the entries of

Var T , is a consequence of our( α, k )-regularity assumption and distribution assumptions on { Z i } . As discussed in the assumptionsection, since the number of observations in leaf nodes are bounded above, the (pointwise) varianceof the tree estimator at x is bounded above by (a constant times) Var ( Y | X = x ), and we assumedthat the latter function is bounded away from zero. That the oﬀ-diagonal entries are bounded is a12rivial consequence of the fact that E ( Y | X = x ) is Lipschitz and thus bounded. The techniqueswe present to bound Var ˚ T could also be used to bound Var T ; it is in fact true that ( Var T ) k,l → k (cid:54) = l , though we will not pursue this further in this paper.Proposition 2 establishes ˚ T as the central object of study. Recall that T is the tree estimatorwhile ˚ T is its Hajek projection; in other words, Var ˚ T is not covariance matrix of tree estimates.However, our result will demonstrate the the asymptotic normality V − (RF − µ ), where V , thevariance of the Hajek projection ˚RF, is given in terms of Var ˚ T (c.f., (19)). Therefore, (a rescaledversion of) ˚ T is precisely the object needed to conduct inference on the random forest. In particular,combining (28) and (31) yields the fact that Var ˚ T —and hence the asymptotic variance of RF—isdiagonally dominant (i.e., tending to a diagonal matrix in the limit).We may always relabel indices so that the tree is grown on the observations Z , . . . , Z s . Toestablish the bound 32, start with the deﬁnition˚ T − µ = s (cid:88) i =1 E ( T | Z i ) so that Var ˚ T = s Var ( E ( T | Z )) due to independence . (33)To develop the term on the RHS, use the orthogonality condition for conditional expectation Var E ( T | Z ) = Var [ E ( T | Z ) − E ( T | X )] + Var [ E ( T | X )] . (34)Since the tree algorithm is honest, the diﬀerence E ( T | Z ) − E ( T | X ) simpliﬁes, so that for each1 ≤ k ≤ q , E ( T k | Z ) − E ( T k | X ) = E ( I k | X )( Y − E ( Y | I k = 1 , X )) , (35)where T k is the tree estimate at x k , and I k the indicator for whether X and x k belong to the sameterminal node. Therefore, oﬀ-diagonal entry at ( k, l ) of Var [ E ( T | Z ) − E ( T | X )] is equal to E [ E ( I k | X ) E ( I l | X )( Y − E ( Y | X , I k = 1)( Y − E ( Y | X , I l = 1)] . (36)If we expand the terms in the integrand, every term will have the shape E ( I k | X ) E ( I l | X ) · p ( Y , E ( Y | X , I k = 1) , E ( Y | X , I l = 1)) (37)for some multinomial p of degree at most two. Since we have assumed that E ( Y | X = x ) and E ( Y | X = x ) are continuous and hence bounded, E ( p | X = x ) is also bounded. Using the Lawof Iterated Expectations to evaluate (36) then shows that it is bounded by a constant times E [ E ( I k | X ) E ( I j | X )] . (38) Remark.

A direct application of the Cauchy-Schwarz inequality, using only the fact that E ( Y | X = x )13s bounded, would yield the weaker bound (cid:113) E [ E ( I k | X ) E ( I j | X ) ] ≤ (cid:113) E [ E ( I k | X ) E ( I j | X )] (39)up to a multiplicative constant.Recall that I k and I l are indicator variables for whether X belong to the same hypercube as x k and x l , respectively. Therefore, E ( I k | X ) is the probability that the ﬁrst observation is usedfor the prediction at x k , and likewise for E ( I l | X ). Intuitively, this only happens when X isnear x k (respectively, x l ): since x k (cid:54) = x l , X cannot be near to both, meaning that the product E ( I k | X ) E ( I l | X ) is small. Proposition 3.

For two points x and ¯ x ∈ X = [0 , p , deﬁne M ( x, ¯ x ) = E [ E ( I | X ) E ( ¯ I | X )] , (40) where I and ¯ I are indicators for X belonging to the same terminal node as x and ¯ x , respectively.If δ > / and x (cid:54) = ¯ x , M ( x, ¯ x ) = o ( s − (1+ (cid:15) ) ) for some (cid:15) > . (41) Remark.

It is instructive to consider the bound in the preceding display versus M ( x, x ). It is clearfrom the deﬁnition that M ( x, x ) ≥ M ( x, ¯ x ) for all ¯ x . In addition, M ( x, x ) = E ( E ( I | X ) ) ≤ E ( E ( I | X )) = E ( I ). By symmetry, E I = 1 /s (up to constant), as the terminal node at x has abounded number of observations. Therefore, all that the Proposition ensures is that when x (cid:54) = ¯ x ,the quantity M ( x, ¯ x ) is smaller than the “trivial” bound 1 /s .This proposition shows that the contribution of Var [ E ( T | Z ) − E ( T | X )] to the crosscovariances of Var E ( T | Z ) is small, in particular smaller than the required bound (log p s · s ) − .The requirement that δ > /

2, while needed for the proof to go through, is almost certainly notneeded in practice. The reason is that our proof uses δ > / uniform bound on thequantity E ( I k | X ) E ( I l | X ) , (42)while proposition demands a bound on its expectation. Indeed, in the extreme case x = 0 and¯ x = (1 , . . . , (cid:124) , it is easy to see that the expectation satisﬁes the required bound even when δ ≤ / M ( x, ¯ x ) will be smallerthan that predicted by (41). In light of this, an alternative to our cyclic splitting assumption is toassume the high level condition that the splitting algorithm and data generating process confer thebound M ( x, ¯ x ) = o (cid:18) p s · s (cid:19) . (43)14 .4.1 Bounding Var E ( T | X )We turn next to bound the oﬀ-diagonal terms in Var [ E ( T | X )]. As in the statement of Proposition 3,it will be convenient to slightly change notations. We ﬁx 1 ≤ k (cid:54) = l ≤ q , and use the notation x (cid:55)→ x k , ¯ x (cid:55)→ x l , with x denoting the value of X . The goal of this section is to establish the bound Var [ E ( T | X )] kl = E ( E ( T | X = x ) − µ )( E ( ¯ T | X = x ) − ¯ µ )) = o (cid:18) log p ss (cid:19) . (44)where T and ¯ T are the tree estimates at x and ¯ x , and µ and ¯ µ are the (unconditional) expectations of T and ¯ T . (Note that we have also changed the notation of T slightly; before T was the q -dimensionalvector of the estimate at all points; for this section, it is the pointwise estimate at x = x k .)The quantity E ( T | X = x ) − µ = E ( T | X = x ) − E ( T ) measures the degree of “information”that the location of a single observation X carries for the output of the tree at x . Intuitively,when x is near x , the eﬀect of x on the leaf node containing x is more pronounced and we expect E ( T | X = x ) − µ ≈ E ( Y | X = x ) − µ . Conversely, when x is far from x , then its eﬀect indetermining the location of the leaf node containing x diminishes, and E ( T | X = x ) − µ ≈ X = x leaves theintermediate partition containing x , where “intermediate partition” refers to the nodes createdduring the splitting process that are not necessarily terminal. We will see that once X and x areseparated, its eﬀect on the prediction decreases.Towards this end, ﬁx x and let Π denote the terminal node containing x ; Π is a random subsetof X created by axis aligned splits. By Assumption 4, the set of potential splits does not depend onthe sample (in particular, it does not depend on X ). Moreover, splitting ceases after no more than s times, regardless of the subsample X , . . . , X s , as each split reduces the number of observations inits two children nodes by at least one. Therefore, Π takes on only ﬁnitely many possible values, andwe may write E ( T ) = (cid:88) π P (Π = π ) µ π and E ( T | X = x ) = (cid:88) π P (Π = π | X = x ) µ (cid:48) π (45)where µ π = E ( T | Π = π ) and µ (cid:48) π = E ( T | Π = π, X = x ).The hyperrectangle Π is determined by the recursive splitting procedure used to grow the tree,and there is a natural correspondence between (45) and a certain “expectation” taken over the directed acyclic graph (DAG) deﬁned in the following way. Let [0 , p be the root of the DAG; forevery potential split at [0 , p , there is a directed edge to a new vertex, where the vertex is thehyperrectangle that contains x . If the node represented by a vertex is one of the possible values ofΠ, then that vertex is a leaf in the DAG and has no outgoing edges; other vertices carry an outgoingedge for each potential split at that node, with each edge going to another vertex which is again ahyperrectangle containing x .The previous deﬁnition determines the DAG recursively: each vertex in the DAG is a nodecontaining x , with terminal vertices corresponding to terminal nodes. To each terminal vertex v , we15ssociate the value f ( v ) : − µ π as in (45). In addition, each edge e = ( v → w ) corresponds to a splitat a node v producing a halfspace w of v ; associate with this edge a “transition probability” p ( e ) : − P ( s is chosen at v | current node is v ) =: P ( w | v ) . (46)Given the transition probabilities, the value f may be extended to each vertex v recursively via theformula f ( v ) : − (cid:88) e : v → w P ( w | v ) f ( w ) . (47)We refer to f as the continuation value at v , and by construction we have E ( T ) = f (“root”) = f ([0 , p ) . (48)Alternatively, if we had assigned the values f (cid:48) ( v ) = µ (cid:48) v to each terminal vertex and used thetransition probabilities p (cid:48) ( e ) = P ( s is chosen at v | current node is v, X = x ) = P (cid:48) ( w | v ) , (49)then we recover E ( T | X = x ) = f (cid:48) ([0 , p ) after extending f (cid:48) in the same way as f . In otherwords, bounding E ( T | X = x ) − E ( T ) requires bounding the diﬀerence between the two types ofcontinuation values.We will need to assume that p (cid:48) ( e ) ≈ p ( e ); that is, conditioning on a single observation will notaﬀect the probability that a particular split is chosen (i.e., that a particular split is optimal at itsnode). This is a natural assumption in that the optimal split is computed using all the observationsin a particular node, so that conditioning on a single observation should have relatively little eﬀect.This will depend on the speciﬁcs on the splitting algorithm used to a construct a tree, and ruleswhich satisfying the following assumption to be “stable splitting rules.” Assumption (Splitting Stability).

For any node v , the total variation distance between thedistributions { p ( e ) } e : v → w and { p (cid:48) ( e ) } e : v → w is bounded by the volume of v . Speciﬁcally, there existssome (cid:15) > such that for all v , TV( p, p (cid:48) ) ≤ (cid:18) s | v | (cid:19) (cid:15) (up to a constant). (50) Here, | v | denotes the volume of the hyperrectangle at v , i.e., | v | = (cid:12)(cid:12)(cid:12)(cid:12) p (cid:89) j =1 ( a j , b j ) (cid:12)(cid:12)(cid:12)(cid:12) = p (cid:89) j =1 | b j − a j | . (51) Remark.

Since p and p (cid:48) are discrete probability distributions: thus, if p and p (cid:48) are written as vectorsof probability masses, then the total variation distance is the L norm between the two vectors.Since the distribution of X has a density that is bounded above and below, a simple H¨oeﬀding16ound shows that the number of sample points in v is bounded above and below by s | v | , withthe constants adjusted so that the failure probability is less than /s . Since this is smaller thanthe required bound (log p s · s ) − , we may interpret s | v | in (50) to be the number of samples in | v | without loss of generality. Relatedly, [14]’s Lemma 12 (see also the proof of Lemma 2 in [12])extends to this fact to be uniform across nodes.The stability assumption places a restriction on procedure used to select optimal splits: namely,if the decision is made on the basis of m points, then conditioning on any one of the points changesthe optimal split with probability bounded by m − (1+ (cid:15) ) . In practice, most splitting procedures satisfya stronger bound. A set of suﬃcient conditions is given in the following proposition. Proposition 4.

Assume that the optimal split at a node v is chosen based on the quantities f ( µ , . . . , µ Q ) , . . . , f P ( µ , . . . , µ Q ) (52) for some Q ≥ , where µ , . . . , µ Q are the sample averages of the points being split µ k = 1 n v (cid:88) i : X i ∈ v m k ( X i ) (53) where the sum runs over points in v , and n v denote the number of these points.Speciﬁcally, suppose optimal split is decided based on which f i achieves the largest value, i.e.,the value arg max i f i ( µ ) . If f , . . . , f P are Lipschitz, and the functions m , . . . , m Q are such that m k ( X ) is 1-subExponential, then the splitting stability assumption is satisﬁed.Remark. Since X i are bounded, the requirement that m k ( X ) is subExponential allows the use of(8) to compute the optimal split.In general, the conditions in Proposition 4 are suﬃcient to guarantee an exponential boundinstead of a polynomial one as in (50). Thus, Proposition 4 should be viewed as simply as providinga plausibility argument that stable splitting rules are commonly encountered in practice.The next proposition shows that bounds on splitting probabilities automatically imply a relatedbound on the continuation values. Proposition 5.

Suppose the splitting probabilities satisfy a generic bound ∆( · ) in that TV( p, p (cid:48) ) ≤ ∆( s | v | )log s at each node v. (54) For example, ∆( z ) = z − (1+ (cid:15) ) . Then for any node v containing x but not x , | f ( v ) − f (cid:48) ( v ) | ≤ C ∆( s | v | ) (55) for some constant C not depending on v . For example, the probability that a binomial random variable B ( n, p ) deviates from np by more than √ n log n isless than C/s for some constant C . z ) = z − (1+ (cid:15) ) , where the factor (cid:15) allows us toignore the extra logarithm. In that case, we may put the bounds on TV( p, p (cid:48) ) and | f − f (cid:48) | togetherand establish required bound on Var E ( T | X ). Proposition 6.

Suppose that the splitting rule is stable as in (50) and that δ > − α . For x (cid:54) = x , | E ( T | X = x ) − E ( T ) | = o (cid:18) s (cid:15) (cid:19) (56) for some (cid:15) > . In particular, the oﬀ-diagonal entries of Var E ( T | X ) are o ( s − (1+ (cid:15) ) ) as at leastone of x and ¯ x is distinct from x . Recall that Proposition 3 required that δ > /

2. Since α < / δ > − α in Proposition 6 is more restrictive. Just like Proposition 3, we argue that thisrequirement is plausibly looser in applications. The reason is that it is used to give the followingbound on hyperrectangles v created after L splits | v | ≥ α L . (57)The RHS appears since potential split may reduce the volume of a node by at most α : but only anexponentially small (i.e., 2 − L ) proportion of nodes are the result of taking the smallest possiblesplit L times! The “average” node has volume approximately (1 / L , so that δ > / Combining Propositions 3 and 6 with equations (33) and (34) yields the desired bound (31) on theoﬀ-diagonal terms of

Var ˚ T discussed at the beginning of this section. Therefore, Proposition 2applies, and the joint normality of the random forest estimator is established. The previous sections focused on deriving the asymptotic normality result V − / (RF − µ ) dist === ⇒ N (0 , I ) , where V = Var ˚RF = sn Var ˚ T . (58)Recall our standing convention that RF ∈ R q is the random forest estimate at x , . . . , x q and µ is its expectation. According to (9), the target function of the random forest is actually is m ( x ) = E ( Y | X = x ). The results in [12] show that (RF( x ) − m ( x )) / √ V dist === ⇒ N (0 ,

1) pointwisefor each x ∈ X ; since we have shown that V is diagonally dominant in that its oﬀ-diagonal termsvanish relative to the diagonal, the pointwise result carries over to our multivariate setting, and V − / (RF − m ) dist === ⇒ N (0 , I ), where m = ( m ( x ) , . . . , m ( x q )).18oreover, [12] proposes a jackknife estimator that can consistently estimate √ V in the scalarcase. Our diagonal dominance result implies that the random forest estimates at x and ¯ x ∈ [0 , p are independent in the limit n → ∞ , Var (RF( x ) + RF(¯ x )) = Var (RF( x )) + Var (RF(¯ x )) + 2 Cov (RF( x ) + RF(¯ x )) ≈ Var (RF( x )) + Var (RF(¯ x )) , (59)so that the jackknife estimator for the scalar case may be fruitfully applied to obtain conﬁdencebands for functionals of the random forest estimates (i.e., expressions involving the estimate atmore than one point).The accuracy of the approximation above above depends on the decay of the oﬀ-diagonal termsin ﬁnite samples. In this section, we provide a “back of the envelope” bound for the covariance termthat may be useful for practitioners. We stress that the following calculations are (mostly) heuristics:as we have shown above, the covariance term depends on quantities such as M ( x, ¯ x ), which is inturn heavily dependent on the exact mechanics of the underlying splitting algorithm. Since ouraim is to produce a “usable” result, we will dispense with rigorous analysis in the remainder of thissection.To begin, the proofs of Propositions 3 and 6 showed that the asymptotic variance V hasoﬀ-diagonal terms which are upper bounded by M ( x, ¯ x ) + log s (cid:18) ∞ (cid:88) (cid:96) =0 p (cid:96) (cid:19)(cid:18) ∞ (cid:88) (cid:96) =0 ¯ p (cid:96) (cid:19) = M ( x, ¯ x ) + log s E ( L ) E ( ¯ L ) (60)where p (cid:96) = P ( L ≥ (cid:96) ) is the probability that x and x are not separated after (cid:96) splits and likewisefor ¯ p (cid:96) = P ( ¯ L ≥ (cid:96) ). That is, L is the number of splits before which x and x belong to the samepartition. If we denote by I (resp. ¯ I ) the indicator variable that X is in the terminal node of x (resp. ¯ x ), then the events { I = 1 } and { L = log s } are equal, so that E ( I | X = x ) = P ( L = log s ) ≤ E L log s . (61)Replacing the inequality with an approximation, we have E L = (log s ) E ( I | X = x ). All of thisshows that the covariance term is bounded by(log s ) E ( I | X = x ) E ( ¯ I | X = x ) ≈ (log s ) M ( x, ¯ x ) . (62) Remark.

Taken loosely, this heuristic says that the random forest estimator RF, considered as afunction on the domain X , is asymptotically Gaussian with covariance process (log s ) · M ( x, ¯ x ).We stress that this is not implied by our theoretical results, as there we kept the number q of testpoints ﬁxed.Towards a useful heuristic, we will consider a bound on the correlation instead of the covariance. This is a very crude upper bound as we have dropped the quantity ∆( α (cid:96) s ) from the inﬁnite series.

19n our notation, the result of [12] lower bounds M ( x, x ) (and M (¯ x, ¯ x )), while our paper provides an upper bound on M ( x, ¯ x ). Ignoring the logarithmic terms, we have (cid:12)(cid:12)(cid:12)(cid:12) Cov (RF( x ) , RF(¯ x )) (cid:112) Var

RF( x ) · Var

RF(¯ x ) (cid:12)(cid:12)(cid:12)(cid:12) ≈ M ( x, ¯ x ) (cid:112) M ( x, x ) M (¯ x, ¯ x ) . (63)Recall that M ( x, ¯ x ) = E [ E ( I | X ) E ( ¯ I | X )], which decays as ¯ x moves away from x . Using theprevious expression (note that M ( x, x ) ≈ M (¯ x, ¯ x ) due to symmetry between x and ¯ x ), we canbound the correlation from purely geometric considerations. Since the integrand E ( I | X ) E ( ¯ I | X ) (64)decays as X moves away from x (and ¯ x ), we may imagine that its integral M ( x, x ) = (cid:90) x E ( I | X = x ) dx (65)has the largest contributions for points x near x , say those points in a L ∞ -box of side lengths d with volume d p , i.e., { y ∈ [0 , p : (cid:107) x − y (cid:107) ≤ d/ } . If we accept this, then the contributions for theintegral M ( x, ¯ x ) would come from points that are with d/ x and ¯ x , and to a ﬁrst degreeapproximation, the volume of these points { y ∈ [0 , p : (cid:107) x − y (cid:107) ∞ ≤ d/ , (cid:107) ¯ x − y (cid:107) ≤ d/ } is( d − z ) . . . ( d − z p ) ≈ d p − ( z + · · · + z p ) d p − , where z i = | x j − ¯ x j | , (66)where the approximation is accurate if | z i | (cid:28)

1. Dividing through by d p , the proportion of thevolume of the latter set is 1 − d (cid:107) x − ¯ x (cid:107) , which leads to the heuristic (cid:12)(cid:12)(cid:12)(cid:12) Cov (RF( x ) , RF(¯ x )) (cid:112) Var

RF( x ) · Var

RF(¯ x ) (cid:12)(cid:12)(cid:12)(cid:12) ≈ − c (cid:107) x − ¯ x (cid:107) , for some constant c. (67)The RHS has the correct scaling when x = ¯ x , i.e., the correlation equals one when (cid:107) x − ¯ x (cid:107) = 0. Tomaintain correct scaling at the other extreme with (cid:107) x − ¯ x (cid:107) = p , we should take c = 1 /p , so that (cid:12)(cid:12)(cid:12)(cid:12) Cov (RF( x ) , RF(¯ x )) (cid:112) Var

RF( x ) · Var

RF(¯ x ) (cid:12)(cid:12)(cid:12)(cid:12) = 1 − p p (cid:88) i =1 | x i − ¯ x i | . (68)Of course, this heuristic must be incorrect in that it does not depend on s ; our theoreticalresults show that even for non-diametrically opposed points, the correlation drops to zero as s → ∞ .Therefore, another recommendation is to use (cid:12)(cid:12)(cid:12)(cid:12) Cov (RF( x ) , RF(¯ x )) (cid:112) Var

RF( x ) · Var

RF(¯ x ) (cid:12)(cid:12)(cid:12)(cid:12) = min (cid:18) − s (cid:15) p p (cid:88) i =1 | x i − ¯ x i | , (cid:19) , (69)for some (cid:15) >

0, where dependence s (cid:15) comes from considering the decay of M ( x, ¯ x ) as ¯ x moves away20rom x (c.f. the proof Proposition 3). In this section, we discuss the results of simulations calculating the correlation structure. Forour experiment, we set p = 2, so that the covariates X are distributed on the unit square. Thedistribution of X is chosen to be “four-modal” X ∼  ¯ N ( µ , I ) with probability 1 / N ( µ , I ) with probability 1 / N ( µ , I ) with probability 1 / N ( µ , I ) with probability 1 /  µ = (0 . , . (cid:124) µ = (0 . , . (cid:124) µ = (0 . , . (cid:124) µ = (0 . , . (cid:124) (70)and ¯ N denotes a truncated multivariate Gaussian distribution on the unit square. Thus, X hasa bounded density on the unit square, and has four peaks at µ , . . . , µ . The distribution of Y conditional on X = ( x , x ) is Y ∼ x + x N (0 , . (71)The random splitting probability is δ = 1 /

2, and the regularity parameters are α = 0 .

01 and k = 1,so that the tree is grown to the fullest extent (i.e., terminal nodes may contain a single observation),with each terminal node lying on the 101 ×

101 grid of the unit square. For each sample size n , ﬁvethousand trees are grown, and the estimates are aggregated to compute the correlation.Figure 1 plots the correlation of estimates at x and ¯ x as a function of the L norm (cid:107) x − ¯ x (cid:107) .The calculation is performed by ﬁrst ﬁxing x , then calculating the sample correlation (across ﬁvethousand trees) as ¯ x ranges over each cell: the correlation is associated with the L norm (cid:107) x − ¯ x (cid:107) .This process is then repeated by varying the reference point x , and the correlation at (cid:107) x − ¯ x (cid:107) is theaverage of the correlations observed. The ﬁgure demonstrates that the linear heuristic (69) given inthe previous section is conservative: it is evident that correlation decreases super-linearly as x and¯ x become separated.Figure 2 plots the correlation on a logarithmic scale, which shows that that correlation decayis exponential in a neighborhood of unity. In other words, simulations suggest that the correctheuristic is of the shape (cid:12)(cid:12)(cid:12)(cid:12) Cov (RF( x ) , RF(¯ x )) (cid:112) Var

RF( x ) · Var

RF(¯ x ) (cid:12)(cid:12)(cid:12)(cid:12) ≈ e − λ (cid:107) x − ¯ x (cid:107) (72)for a suitable λ . That is, ¯ N ( µ, Σ) denotes the conditional distribution of x ∼ N ( µ, Σ) on the event x ∈ [0 , . .000.250.500.751.00 0.0 0.5 1.0 1.5 L1 Norm C o rr e l a t i on Correlation with Varying Sample Size and L1 norm

Figure 1: Correlation as a function of sample size and L norm. −10.0−7.5−5.0−2.50.0 0.0 0.1 0.2 0.3 L1 Norm

Log ( C o rr e l a t i on ) n Log Correlation with Varying Sample Size and L1 norm

Figure 2: Logarithm of correlation as a function of sample size and L norm. Random forests and tree-based methods form an important part of of an applied data analysistoolkit. In this paper, we studying the covariance between random forest estimates at severalpoints. We develop a novel construction of a directed acyclic graph that keeps track of the splittingprobabilities when knowledge of one point is known (Propositions 5 and 6). As part of the proof,we establish stability properties of a class of splitting rules (see Proposition 4). We also identify(Proposition 3) M ( x, ¯ x ), which (roughly) captures the likelihood of two points belonging to thesame terminal node, as a key quantity in controlling the oﬀ-diagonal term of the covariance matrixof the multivariate random forest.In this way, this paper provides the a theoretical basis for performing inferences on functionalsof target function (e.g., a heterogeneous treatment eﬀect) when the functional is based on values of22he target function at multiple points in the feature space. Speciﬁcally, we show that the covariancevanishes in the limit relative to the variance, and provide heuristics on the size of the correlation inﬁnite samples.We close with discussing a couple avenues for future research. The ﬁrst is extending ourframework to cover categorical or discrete-valued features. Here, new assumptions would be requiredin order to maintain the guarantee that node sizes are “not too small.” Second, our bounds—afterpotential improvements—on the covariance matrix of the random forest may be used with the recentresults of [16, 17] in order to provide ﬁnite sample Gaussian approximations. This would provide asounder theoretical underpinning of for our heuristics, and increase the usefulness of this paper forpractitioners. A Proofs

Proof of Proposition 1.

For random vectors in R q , deﬁne the inner product (cid:104) X, Y (cid:105) : − E ( X (cid:124) M Y ) . (73)For each subset A ⊆ { , . . . , n } , let H A be the set of square-integrable random vectors of the form g ( X i : i ∈ A ), where g is a function of | A | arguments, satisfying the condition that E ( g ( X i : i ∈ A ) | { X i : i ∈ B } ) = 0 (74)for all subsets B (cid:40) A . It is easy to see that collection H A are pairwise orthogonal as A ranges oversubsets { , . . . , n } . By induction on r = | A | , the direct sum (cid:76) B ⊆ A H B is equal to the set of allstatistics which are functions of { X i : i ∈ A } . In particular, (cid:76) A H A is the set of all statistics basedon { X , . . . , X n } . When the variables { X , . . . , X n } are IID, then H A depends only on | A | in thatthere exist collections of functions H , H , . . . , H n , where H k is a collection of k -ary functions, suchthat H A = { g ( X i : i ∈ A ) : g ∈ H | A | } . (75)The proof is complete by letting f k be the projection of f onto H k according to the inner productgiven in (73). Proof of Proposition 2.

We will prove the slightly more general statement that if A n and B n aretwo sequences of square matrices with bounded entries such that B ii > δ for some δ for all n and A ii ≥ B ii log n (76)and A ij = o (1 / log n ), then tr( A − B ) →

0. To prove this, start with the determinant formuladet A = (cid:80) π ( − sgn π (cid:81) qi =1 a iπ i , where the sum runs over permutations π of { , . . . , n } and sgn π isthe sign of the permutation. Since the oﬀ-diagonal entries of A ij are assumed to vanish relative to23 ii , we have | det A | ∼ (cid:81) A ii , where the notation a ∼ b stands for c | b | ≤ | a | ≤ c (cid:48) | b | for constants c and c (cid:48) not depending on n . Next, recall Cramer’s Rule( A − ) ii = det A − i / det A, (77)where A − i is the matrix A with its i -th row and i -th column removed. A similar argument showsthat | det A − i | ∼ (cid:81) j (cid:54) = i A jj , whence ( A − ) ii ∼ A ii . (78)In particular, the i -th diagonal entry of the matrix A − B is given by( A − B ) ii = ( A − ) ii B ii + (cid:88) j (cid:54) = i ( A − ) ij B ji ∼ B ii A ii ≤ log n, (79)where the ﬁnal relation is due to the fact that ( A − ) ij is itself a polynomial in the entries of A (viz.,the cofactor matrix of A ) divided by the determinant. Therefore, the trace of A − B is on the orderof log n , since the dimension q × q of each matrix is ﬁxed. Using the subsample size s = n β , so that s/n = n − (1 − β ) completes the proof. Proof of Proposition 3.

Recall that the splitting algorithm has a probability δ chance of splitting onthe j -th axis. Since each terminal node contains a constant number of points, the number of terminalnodes is equal (up to constant) to the subsample size s . Therefore, the number of splits required toreach a terminal node is bounded (by a constant) by log s/K = log s/K , where K = 2 k − x (cid:54) = ¯ x , we have 0 < (cid:107) x − ¯ x (cid:107) ∞ ≤ (cid:107) x − x (cid:107) ∞ + (cid:107) ¯ x − x (cid:107) ∞ (80)for all x ∈ X . In particular, given any x there exists some j ∈ { , . . . , p } and a constant β forwhich either | x j − x j | > β or | ¯ x j − x j | > β . Without loss of generality, we may assume that theformer case holds. Certainly, a necessary condition for X = x to belong to the same leaf node as x (i.e., a necessary condition for { I = 1 } ) is for the length of the ﬁrst axis of that leaf node to belarger than β .Let c j ( x ) denote the number of splits in coordinate j along the sequence of splits leading tothe terminal node containing x . By our randomization assumption, each split has at least anindependent chance δ of being chosen, and since we cycle through each coordinate (c.f., Assumption2), c j ( x ) (cid:23) p B (cid:18) log sK , δ (cid:19) where (cid:23) stands for stochastic dominance . (81)Per Assumption 4, that each split along the j -th axis decreases its length by a factor of at least241 − α ). Since splitting begins in the unit hypercube,(1 − α ) c ( x ) ≥ β = ⇒ c ( x ) ≤ log β log(1 − α ) =: ρ. (82)Since { I = 1 } requires that the length of the ﬁrst axis to exceed β (a constant), this proves E ( I | X = x ) ≤ P (cid:20) B (cid:18) log sK , δ (cid:19) ≤ pρ (cid:21) . (83)Since pρ is a constant, we may conclude P (cid:20) B (cid:18) log sK , δ (cid:19) ≤ pρ (cid:21) ≤ (1 − δ + o (1)) log s/K = (cid:18) s (cid:19) log − δ + o (1) . (84)Finally, recall base of the logarithm is two since the tree is binary. Therefore, if we choose δ > /

2, the exponent exceeds 1 and the proof is complete.

Proof of Proposition 4.

The easiest case is the splitting decision in the root node [0 , p , so we starthere. We prove the result by introducing a coupling between the split decisions with and withoutconditioning on X = x S = arg max i f i (cid:18) s s (cid:88) i =1 m ( X i ) , . . . , s s (cid:88) i =1 m Q ( X i ) (cid:19) =: f i S (cid:48) = arg max i f i (cid:18) s (cid:20) m ( x ) + s (cid:88) i =2 m ( X i ) (cid:21) , . . . , s (cid:20) m Q ( x ) + s (cid:88) i =2 m Q ( X i ) (cid:21)(cid:19) =: f (cid:48) i . (85)Here, S is the split made on the sample X , . . . , X s and S (cid:48) is the split made on the sample conditionalon X = x . Note that we may assume without loss of generality that the splits are not randomlychosen, since on that event the splitting probabilities are trivially equal. Clearly, a necessarycondition for S (cid:54) = S (cid:48) is the existence of a pair 1 ≤ i (cid:54) = j ≤ P for which f i > f j but f (cid:48) j > f (cid:48) i . (86)Since f is Lipschitz and its arguments are subExponential by assumption, the quantities f i and f j concentrate around their respective limits f i ( E m , . . . , E m Q ) and f j ( E m , . . . , E m Q ); hence,whenever f i ( E m , . . . , E m Q ) > f j ( E m , . . . , E m Q ) we will have f i − f j > s with probability at least 1 − O ( e − cs ) for some constant c . (87)However, the diﬀerence of the arguments of f i in (85) diﬀer by at most 1 /s , the diﬀerence due tothe | m ( x ) − m ( X ) | /s . By Lipschitz continuity, a change of 1 /s in the arguments changes thefunction values by a proportional amount, so f (cid:48) j > f i is impossible when f i − f j > /s . It followsthat ( f i > f j , f (cid:48) j > f (cid:48) i ) occurs with probability at most O ( e − cs ), and we ﬁnish by noting taking a25nion over the (cid:0) P (cid:1) pairs ( i, j ). This result result for [0 , p will be referred to as the base case.Note that the above actually proves something stronger, namely that for every split τ , P ( S (cid:54) = s | S (cid:48) = τ ) < e − cs and P ( S (cid:48) (cid:54) = s | S = τ ) < e − cs . (88)It follows that see that for any τ , the total variation distance between ( X , . . . , X s | S = τ ) and( X , . . . , X s | S (cid:48) = τ ) is at most e − cs . To see this, note that S and S (cid:48) are functions of X i only, sothat the densities of these two distributions are p ( x ) = ( S ( x ) = τ ) p ( x ) P ( S = τ ) and p (cid:48) ( x ) = ( S (cid:48) ( x ) = τ ) p ( x ) P ( S (cid:48) = τ ) (89)respectively. We may assume without loss of generality that P ( S (cid:48) = τ ) ≥ P ( S = τ ) so that thetotal variation is (cid:90) | p ( x ) − p (cid:48) ( x ) | = (cid:90) S = τ p ( x ) − p (cid:48) ( x ) + (cid:90) S (cid:54) = τ,S (cid:48) = τ p (cid:48) ( x )= 1 − P ( S (cid:48) = τ, S = τ ) P ( S (cid:48) = τ ) + P ( S (cid:48) = τ, S (cid:54) = τ ) P ( S (cid:48) = τ ) = 2 P ( S (cid:54) = τ | S (cid:48) = τ ) < e − cs . (90)The upshot is that when considering the splitting probability in the next node, we can ignore thediﬀerence in the distribution of X , . . . , X s when conditioning on S = τ versus conditioning on S (cid:48) = τ and pay a cost O ( e − cs ).Now consider bounding the diﬀerence of the splitting probabilities at the next split P ( S = s | S = τ ) − P ( S = s | S = τ, X = x ) . (91)Again, the strategy is to ﬁnd a coupling ( S , S (cid:48) ) such that S ∼ ( S | S = τ ) and S (cid:48) ∼ ( S | S = τ, X = x ) (92)with TV( S , S (cid:48) ) ≤ e − s | v | , | v | being the hyper-rectangle corresponding to one of the halfspaces pro-duced by τ . Since the distribution of X , . . . , X s conditional on S = τ diﬀers from its unconditionaldistribution by an amount e − cs in the total variation distance, we could use the following coupling S = arg max f i (cid:18) n v (cid:88) X i ∈ v m ( X i ) , . . . , s | v | (cid:88) X i ∈ v m Q ( X i ) (cid:19) S (cid:48) = arg max f i (cid:18) n v (cid:88) X (cid:48) i ∈ v m ( X (cid:48) i ) , . . . , s | v | (cid:88) X i ∈ v m Q ( X (cid:48) i ) (cid:19) (93)where X i follows the distribution of ( X , . . . , X s ) conditional on S = τ and X (cid:48) i follows thedistribution conditional on S = τ and X = x . By conditioning on S = τ instead of S (cid:48) = τ , weincrease the total variation by an amount e − cs via the triangle inequality.26ow the rest of the proof is the same as in the base case, noting that with high probability, thenumber n v of points in v is equal to s | v | up to an multiplicative constant (1 − η ) with probability e − sη . The previous bounds are applied recursively at each depth l of the DAG. At depth l , weincur an “approximation cost” from the total variation distance bounded by e −| v | s . Since s | v | ≥ α l ,it follows that l ≤ O (log s | v | ), whence the cumulative cost at depth l is O (log s | v | · e − c | v | s ). Puttingeverything together, we have proven thatTV( p, p (cid:48) ) ≤ O (log( | v | s ) · e − c | v | s ) ≤ o (cid:18) s | v | (cid:19) (cid:15) for some (cid:15) > . (94) Proof of Proposition 5.

The claim is trivially true (by choosing an appropriate constant) if v is aterminal node. Thus, ﬁx a non-terminal node v such that x / ∈ v and let X = X v = { X i : X i ∈ v } (95)denote the set of points landing in v , so that k : − | X | ∈ { , . . . , n − } .Recall that f = f ( v ) and f = f (cid:48) ( v ) are the respective expectations of the tree estimator at x when the sequence of splits is such that v is the current subset of X containing x , with f (cid:48) beingcalculated conditional on X = x . It follows that f and f (cid:48) are functions of the distribution of its“input vector” X . In a slight abuse of notation, let Π and Π (cid:48) denote sequence of splits distributedaccording to the probabilities p and p (cid:48) . We will show that, for each k ∈ { , . . . , n − } , the totalvariation distance of( X | | X | = k, Π = v ) and ( X | | X | = k, Π (cid:48) = v, X = x ) (96)is bounded by (log s ) · ∆( s | v | ). This will suﬃce to bound | f − f (cid:48) | by the variational deﬁnition oftotal variation distance TV( p, p (cid:48) ) = sup | g |≤ | E A ∼ p g ( A ) − E A ∼ p (cid:48) g ( A ) | . (97)Since x / ∈ v , X is not an element of X , so that X and X are independent. Since the split Π (cid:48) isdistributed according to splitting probabilities when X = x , we have,dist( X | | X | = k, Π (cid:48) = v, X = x ) = dist( X | | X | = k, Π (cid:48) = v ) . (98)The depth of v is at most log s , so that P (Π = Π (cid:48) ) ≥ − (log s )∆( s | v | ) by applying the splittingstability assumption log s many times using a union bound. By (98) the total variation distance ofthe distributions in (96) diﬀers by P (Π (cid:54) = Π (cid:48) ), and the result follows. Poor of Proposition 6.

The idea is to recursively expand the formulas E T and E ( T | X ) in terms27f the directed acyclic graph. We start with | E T − E ( T | X = x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:88) v p ( e ) f ( e ) − (cid:88) v p (cid:48) ( e ) f (cid:48) ( e ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) v | p ( e ) − p (cid:48) ( e ) | f (cid:48) ( e ) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:88) v p (cid:48) ( e )( f ( e ) − f (cid:48) ( e )) (cid:12)(cid:12)(cid:12)(cid:12) , (99)where the sum runs over nodes v after the ﬁrst split, i.e., [0 , p → v . The second summand maybe split into two, one over x ∈ v and the other x / ∈ v . If we assume splitting stability held withfunction ∆, then Proposition 5 allows us to bound the second term, so that | E T − E ( T | X = x ) | ≤ (cid:88) v | p ( e ) − p (cid:48) ( e ) | f (cid:48) ( e ) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:88) v p (cid:48) ( e )( f ( e ) − f (cid:48) ( e )) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ∆( αs ) + (log s )∆( αs ) + (cid:88) x ∈ v p (cid:48) ( e ) | f ( e ) − f (cid:48) ( e ) | , (100)where used the fact that | v | ≥ α . Now, each of the terms | f ( e ) − f (cid:48) ( e ) | may be bounded by∆( α s ) + log( s )∆( α s ) + (cid:80) x ∈ w ( · · · ). Continuing in this way, we have | E T − E ( T | X = x ) | ≤ log( s )(∆( s ) + p ∆( αs ) + p ∆( α s ) + . . . ) = log s ∞ (cid:88) (cid:96) =0 p (cid:96) ∆( α (cid:96) s ) , (101)where p (cid:96) is the probability that x and x (cid:96) belong to the same node after (cid:96) splits. In other words, ifwe let L be the number of splits after which x and x are separated, then p (cid:96) = P ( L ≥ (cid:96) ) . (102)Since x (cid:54) = x , we may assume without loss of generality that (cid:107) x − x (cid:107) > β for some ﬁxed β (c.f.the proof of Proposition 3). In particular, p (cid:96) ≤ (1 − δ + o (1)) (cid:96) (103)for suﬃciently large (cid:96) . Moreover, ∆( α l s ) = s (cid:15) ( α (cid:15) ) (cid:96) , whence δ > − α δ is enough to ensurethat inﬁnite series is less than s (cid:15) . In particular, as s →

0, we may take (cid:15) →

0, so that therestriction is satisﬁed (after suitable constants) by δ > − α . This completes the proof. References [1] Leo Breiman. Random forests.

Machine Learning , 2001.[2] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, andTie-Yan Liu. Lightgbm: A highly eﬃcient gradient boosting decision tree. In

Proceedings of he 31st International Conference on Neural Information Processing Systems , NIPS’17, page3149–3157, Red Hook, NY, USA, 2017. Curran Associates Inc.[3] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings ofthe 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM.[4] Susan Athey, Julie Tibshirani, and Stefan Wager. The grf algorithm. https://github.com/grf-labs/grf/ , 2017.[5] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, andAndrey Gulin. Catboost: Unbiased boosting with categorical features. In

Proceedings ofthe 32nd International Conference on Neural Information Processing Systems , NIPS’18, page6639–6649, Red Hook, NY, USA, 2018. Curran Associates Inc.[6] Baptiste Gregorutti, Bertrand Michel, and Philippe Saint-Pierre. Correlation and variableimportance in random forests.

Statistics and Computing , 27(3):659–678, 2017.[7] Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis.Conditional variable importance for random forests.

BMC bioinformatics , 9(1):307, 2008.[8] Robin Genuer, Jean-Michel Poggi, and Christine Tuleau-Malot. Variable selection using randomforests.

Pattern recognition letters , 31(14):2225–2236, 2010.[9] Donald B. Rubin. Estimating causal eﬀects of treatments in randomized and nonrandomizedstudies.

Journal of Educational Psychology , 66(5):688–701, 1974.[10] Guido W. Imbens and Donald B. Rubin.

Causal Inference for Statistics, Social, and BiomedicalSciences: An Introduction . Cambridge University Press, 2015.[11] Keisuke Hirano, Guido W. Imbens, and Geert Ridder. Eﬃcient estimation of average treatmenteﬀects using the estimated propensity score.

Econometrica , 71(4):1161–1189, 2003.[12] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment eﬀectsusing random forests.

Journal of the American Statistical Association , 113(523):1228–1242,2018.[13] Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests.

Ann. Statist. ,47(2):1148–1178, 04 2019.[14] Stefan Wager and Guenther Walther. Adaptive concentration of regression trees, with applica-tion to random forests. arXiv: Statistics Theory , 2015.[15] Nino Arsov, Martin Pavlovski, and Ljupco Kocarev. Stability of decision trees and logisticregression, 2019.[16] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Central limit theorems andbootstrap in high dimensions.

Ann. Probab. , 45(4):2309–2352, 07 2017.[17] Xiaohui Chen. Gaussian and bootstrap approximations for high-dimensional u-statistics andtheir applications.

Ann. Statist. , 46(2):642–678, 04 2018.[18] Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

The elements of statistical learning:data mining, inference, and prediction . Springer Science & Business Media, 2009.2919] Google and other contributors. TensorFlow: Large-scale machine learning on heterogeneoussystems, 2015. Software available from tensorﬂow.org.[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of MachineLearning Research , 12:2825–2830, 2011.[21] Rogier Quaedvlieg. Multi-horizon forecast comparison.

Journal of Business & EconomicStatistics , pages 1–14, 2019.[22] A. W. van der Vaart.

Asymptotic Statistics . Cambridge Series in Statistical and ProbabilisticMathematics. Cambridge University Press, 1998.[23] Patrick Billingsley.