AAsymptotic Normality for Multivariate Random Forest Estimators
Kevin LiMassachusetts Institute of [email protected] ∗ December 16, 2020
Abstract
Regression trees and random forests are non-parametric estimators that are widely used in dataanalysis. A recent paper by Athey and Wager shows that the pointwise random forest estimate isasymptotically Gaussian. In this paper, we extend their result to the multivariate case and showthat a vector of estimates, taken at multiple points, is jointly asymptotically normal. Specifically,the covariance matrix of the limiting normal distribution is diagonal, so that the estimates at anytwo points are independent in sufficiently deep trees. We show that the off-diagonal terms arebounded by quantities related to the probability of two given points belonging in the same leaf ofthe resulting tree. Our results rely on certain stability properties of the underlying tree estimator,and we give examples of splitting rules for which this holds. We also provide a heuristic andnumerical simulations for gauging the decay of the off-diagonal term in finite samples.
Trees and random forests are non-parametric estimators first introduced by Breiman [1]. Givena feature space
X ⊂ R p and a set of data points { ( X i , Y i ) } ⊆ X × R , tree estimators recursivelypartitions the feature space into axis-aligned non-overlapping hyperrectangles by repeatedly splitting X along a given axis. The prediction of the tree estimator at a test point x ∈ X is then an aggregateof the targets Y i ’s that land in hyperrectangle containing x ; when Y i is continuous, the aggregate isthe sample mean and the tree is also known as a regression tree. The depth of a tree estimator—defined as the maximal number of splits taken before reaching a terminal hyperrectangle—controlsthe complexity of the tree estimator. There are two popular methods for controlling complexity:the “boosting” approach grows trees of large depth, then reduces complexity by either trimming thetree (i.e., so that predictions are made at a non-terminal hyperrectangle) or introducing a decayfactor; the “bagging” approach instead grows a collection of shallow trees on different subsets of thedata, and averages over the trees for the final prediction. The intuition for bagging is that treesgrown on different subsets are not perfectly correlated, so that aggregation reduces variance and ∗ I would like to thank my advisors Alberto Abadie and Victor Chernozhukov for reviewing multiple drafts of thispaper. In addition, Sophie (Liyang) Sun, Ben Deaner, participants in the seminar class 14.386 (Spring 2020), andparticipants of MIT’s Econometrics Lunch seminar provided very useful feedback. I would also like to thank ProfessorStefan Wager for helping me understand mechanics of covariance estimates in random forest models. When the feature space X need not be rectangular, one may always enlarge X to a rectangular set X (cid:48) that isdefined to the intersection of all rectangular sets containing X . a r X i v : . [ ec on . E M ] J a n alances the bias-variance tradeoff. Estimators of this type are called random forests, and they arethe focus of this paper.Since their introduction in the early 2000s, random forests have become an increasingly importanttool in applied data analysis, owing to a multiple of practical advantages over competing models.First, high-quality random forest libraries are readily available, with popular implementations thatscale to hundreds of distributed workers [2, 3]. Moreover, the core algorithm behind tree estimatorsand random forests are simple enough to allow for rapid prototyping of bespoke implementations,e.g. [4]. Another advantage of tree-based methods is that they can ingest real-world data withoutmuch issue: continuous, discrete, and ordered categorical features may be freely mixed [5], modelestimates are immune to feature outliers, and missing data may be easily incorporated. Firstly, theirconstruction naturally aligns with the spatial locality of most applied: that is, the underlying targetfunction relating Y to X is continuous. Finally, tree models are interpretable, with well-definednotions of feature importance [6, 7], which supports their use as model selection tools [8].Within economics, random forests may be fruitfully applied to estimate heterogeneneous treat-ment effects. In Rubin’s potential outcomes framework [9] (see [10] for an overview), an individual i is associated with two potential outcomes Y (0) and Y (1) , with one of the outcomes being realizeddepending on whether i undergoes treatment. The statistician has access the IID observations { ( X i , W i , Y i ) : 1 ≤ i ≤ n } , where X i is a vector of observed covariates for individual i , W i ∈ { , } is (an encoding of) their treatment status, and Y i = Y ( W i ) i is her realized outcome. One quantity ofinterest is the treatment effect at xτ ( x ) : − E ( Y (1) − Y (0) | X i = x ) . (1)Since only one of Y (0) i and Y (1) i is observed, consistent estimation of τ ( x ) requires further distribu-tional assumptions. A common assumption is unconfoundedness, i.e., that treatment status W i isindependent of Y (1) and Y (0) conditional on X i . Under this assumption, τ ( x ) = E (cid:20) Y i (cid:18) W i e ( x ) − − W i − e ( x ) (cid:19) | X i = x (cid:21) , where e ( x ) = P ( W i = 1 | X i = x ). (2)Here, the key function is e ( x ), known as the propensity score, is the probability of treatment forthe subpopulation with covariates x ; see [11] the derivation and implications. Machine learningmethods—including random forests—may be brought to bear on the problem by estimating e ( x ).Alternatively, unconfoundedness also implies τ ( x ) = E ( Y | W = 1 , X = x ) − E ( Y | W = 0 , X = x ) , (3)so that τ ( x ) may be estimated by fitting two models, one on the subset of the sample in which W = 1, and the other on W = 0. Splits on discrete features partitions that variable into two arbitrary non-empty sets; no changes are needed forordered categorical features.
2n econometric applications, conducting pointwise inference on the target function f : X → R (e.g., to test the null hypothesis H : f ( x ) = 0) requires knowledge about about the rate ofconvergence or asymptotic distribution of the underlying estimator ˆ f ( x ), where x is the point ofinterest. However, functionals of target function are often also of interest: for example, the differenceof treatments effects (i.e., f = τ ) for two different subpopulations is captured by the quantity f ( x ) − f (¯ x ) , (4)where x and ¯ x are covariates describing the two subpopulations. More generally, we might also beinterested in a weighed treatment effect, where a subpopulation x is given an importance weightmodeled as a density µ ( x ). In this case, the corresponding functional of f is (cid:90) x ∈X f ( x ) dµ, where µ is not necessarily the density of x, (5)and the integral is taken over the domain X .Inference on functionals of f requires not only the asymptotic distribution of the point estimate f ( x ), but also the correlation between estimates at different points f ( x ) and f (¯ x ). As a concreteexample, consider the function τ ( x ) and the simple difference τ ( x ) − τ (¯ x ). We have τ ( x ) − τ (¯ x ) = [ E ( Y | W = 1 , X = x ) − E ( Y | W = 1 , X = ¯ x )] − [ E ( Y | W = 0 , X = x ) − E ( Y | W = 0 , X = ¯ x )]=: A − B. (6)We may estimate the difference by estimating A and B separately, fitting a random forest model tothe two “halves” of the dataset where W i = 1 and W i = 0, as discussed above. The estimators ˆ A and ˆ B obtained are thus independent, so that Var ( ˆ A − ˆ B ) = Var ˆ A + Var ˆ B . The variances ˆ A andˆ B then depend on the covariance of their respective random forest estimates at x and ¯ x .This paper studies the correlation structure of a class of random forests models whose asymptoticdistributions were first worked out in [12]. We find sufficient conditions under which the asymptoticcovariance of random forest estimates at different points vanish relative to their respective variances;moreover, we provide finite sample heuristics based on our calculations. To the best of our knowledge,this is the first set of results on the correlation structure of random forest estimators.The present paper builds on and extends the results in [12], which in turn builds on relatedwork [14] on general concentration properties of trees and random forest estimators. See also [13],which extends the random forest model considered here to a broader class of target functions byincorporating knowledge of moment conditions. Stability results established in this paper haveappeared in [15], who study the notions of algorithmic stability for random forests and logisticregression and derive generalization error guarantees. Also closely related to our paper are [16] and[17], concerning finite sample Gaussian approximations of sums and U -statistics in high dimensions.In this context, our paper provides a stepping stone towards applying the theory of finite sample3 -statistics to random forests, where bounds on covariance matrices plays a central role.The paper is structured as follows. In Section 2, we introduce the random forest model andstate the assumptions required for our results; Section 3 contains our main theoretical contributions;Section 4 builds on Sections 3 and discusses heuristics useful in finite sample settings; Section 5concludes. All proofs are found in the appendix. The goal of this paper is to study asymptotic Gaussian approximations of random forest estimators.Throughout, we assume that a random sample { Z i = ( X i , Y i ) : 1 ≤ i ≤ n } ⊆ X × R is given,where each X i is a vector of features or covariates belonging to a subset X ⊆ R p of p -dimensionalEuclidean space, and Y i ∈ R is the response or target corresponding to X i . We will refer to X asthe feature space or the feature domain.Given the data set { Z i } ni =1 , a tree estimator recursively partitions the feature space by makingaxis aligned splits. Specifically, an axis-aligned split is a pair ( j, t ) where j ∈ { , . . . , p } is the splitting coordinate and t ∈ R is the splitting index ; given a subset R ⊆ X , a split ( j, t ) divides R into left and right halves { x ∈ R : x j < t } and { x ∈ R : x j > t } , (7)where x j denotes the j -th coordinate of the vector x . Starting with the entire feature space X , therecursive splitting algorithm computes a (axis-aligned) split based on the data { Z i : 1 ≤ i ≤ n } ; forexample, when the target Y i continuous, a popular choice is( j, t ) = arg min ˜ j, ˜ t (cid:88) i : X i ∈ L ( Y i − µ L ) + (cid:88) i : X i ∈ R ( Y i − µ R ) (8)where L = L (˜ j, ˜ t ) and R = R (˜ j, ˜ t ) are the two halves of X obtained by the split (˜ j, ˜ t ), with µ L and µ R being the averages of targets Y i whose corresponding feature land in L and R , respectively.After the first split, X is split into two halves L and R . The process is then repeated for L and R separately, in that a split for L is computed by using the subset of the data whose features X i belong in L , and likewise for R . Each of the halves is then split again, and so on, until a stoppingcriterion is met. The process completes when the stopping criterion is satisfied for each node; atthis point, the collection of halves form a partition of X , with each partition—and all the halvesthat came before it—being the intersection of a hyperrectangle with X . The sequence of splitscorresponds to a tree in the natural way; we will call the halfspaces that arise during the splittingas nodes , and elements of the final partition terminal nodes . According to (7), we exclude edge cases where points lead on the an “edge” of the rectangle. This is not an issuefor continuous variables, while for categorical variables, the definition should be slightly changed so that one of thehalves contain x j = t . In this paper, we deal only with continuous features. N , . . . , N q of terminal nodes (which form a partition of X ), the predictionof the tree at generic test point x ∈ X is the average of the responses that belong in the sameterminal node as x T ( x ; ξ, Z , . . . , Z n ) = q (cid:88) j =1 ( x ∈ N j ) 1 | N j | (cid:88) i : X i ∈ N j Y j , (9)where the outer sum runs over observations i for which X i belongs to the partition N j , and | N j | is the number of such observations. The input ξ is an external source of randomization to allowfor randomized split selection procedures. Thus, T ( x ; ξ, Z , . . . , Z n ) refers to prediction at x for atree grown using data { Z , . . . , Z n } with randomization parameter ξ . As a function of x , keeping Z , . . . , Z n and ξ fixed, T : X → R is then a step function, i.e., a linear combination of indicatorfunctions of rectangular sets.We note here that equations (8) and (9) are not the only possible choices; in particular, the ruleused to choose the optimal split may be path dependent (i.e., dependent on previous splits) as inpopular implementations (see [3], which allows for random forests but uses gradient boosting afterthe initial split), and the final prediction rule (9) may instead do a final linear fit or use weightedaverage (c.f., [2, 3]). Analysis of tree (and random forest) models is complicated in these cases, sothe present paper stipulates that the algorithm uses a splitting rule that is “similar” to similar to(8) (c.f. Proposition 4) and uses (9) as the final prediction. In particular, this implies that thetarget estimated by the tree estimator is the regression function x (cid:55)→ E ( Y | X = x ). Given a specific tree estimator T , i.e., given a set of splitting rules which defines the tree estimator, wedefine the random forest estimator to be the average of the tree estimator across all (cid:0) ns (cid:1) subsamples,marginalizing over the randomization device ξ . Specifically, the random forest estimate RF( x ) at x ∈ X , given data { Z , . . . , Z n } , is defined to beRF( x ; Z , . . . , Z n ) = 1 (cid:0) ns (cid:1) (cid:88) i ,...,i s E ξ T ( x ; ξ, Z i , . . . , Z i s ) , (10)where the summation runs over size- s subsets of { , . . . , n } , and the inner expectation is taken withrespect to ξ . Importantly, note each tree is grown on a subsample of size s < n . We follow [12] inassuming that s ∼ n β for some β sufficiently close to one; specifically, we assume throughout thatthe subsample size is chosen as to satisfy the assumptions of Theorem 3 of [12], so that—alongwith other assumptions to be introduced presently—the random forest estimator RF is a consistentestimator of the target function x (cid:55)→ E ( Y | X = x ) (see discussion above).In keeping with the notation [12], we will write T ( x ; Z , . . . , Z s ) to refer to the expectation of T ( x ; ξ, Z , . . . , Z s ) over ξ . With this notation, the random forest (at x ) estimator (10) is a U -statisticwith the size s kernel ( Z , . . . , Z s ) (cid:55)→ T ( x, Z , . . . , Z s ). We will discuss the U -statistic representationof RF in greater detail in Section 3. 5 .3 Discussion of Model Assumptions As our results will be an extension of the results in [12], we will study the same model of randomforests and adopt a similar set of assumptions. The assumptions regarding tree estimators haveappeared before in [14], while the distributional assumptions on the conditional moments of Y arestandard (see e.g., Chapters 7 and 9 in [18]).The first—and most bespoke—assumption is that the tree algorithm is honest. Intuitively, honestystipulates that that knowledge of the tree structure does not affect the conditional distribution oftree estimates when the features are fixed. Assumption 1 (Honesty) . The target Y i and the tree structure (i.e., the splitting coordinates andsplitting indices) are independent conditional on X i . Specifically, we require dist( Y i | X i , S ) = dist( Y i | X i ) , (11) for all observations i where Y i participates in the final prediction, where S is set of splits chosen bythe tree algorithm. (The second equality is automatic as a consequence of independent observations.) There are several ways to satisfy this assumption. The first is to calculate splits based only thefeatures X i only. This rules out out the example splitting rule given in (8), so we could instead useits analog in feature space( j, t ) = arg min ˜ j, ˜ t (cid:88) i : X i ∈ L (cid:107) X i − µ L (cid:107) + (cid:88) i : X i ∈ R (cid:107) X i − µ R (cid:107) , (12)where here µ L and µ R denote the average (i.e., center of mass) of the X i in each halfspace. Inthis instance, the choice of splits is essentially a clustering algorithm that finds the best divisionof the sample points into two parts. Another way to satisfy to the honesty assumption while stillcomputing splits based on the targets is to use sample splitting. The data is partitioned into twoparts I and I ; observations in I and X i ∈ I may be freely used during the splitting process,while Y i ∈ I are used to determine terminal node values. In this case, equality in (11) is requiredto hold for i ∈ I . Finally, a third method satisfy honesty requires the existence of auxiliary data { W i } . During the splitting stage (“model fitting”), splits are computed as if the response variableis W i ; for example, (8) is used with µ L and µ R being the average of the { W i } . Once the tree is fullygrown, predictions (“model inference ”) are made using Y i ’s as usual. The practice of using such surrogate targets is especially popular in time-series prediction, where different horizons are used infitting and inference steps (c.f., [21]).In the present paper, for simplicity of notation, we shall assume that the first scheme is used tosatisfy honesty—namely, splitting decisions are based on the feature vectors X i only. Our results The terminology ‘inference’ is used to mean computing the predictions of an existing model, which is unrelated tothe typical usage of ‘inference’ in econometrics. The former terminology is standard in applied settings, used whendescribing a data pipeline: see the documentation of [19, 20]. Our next assumption will ensure that each one of the p axes is chosen as the splitting coordinatewith a probability bounded from below. Assumption 2 (Randomized Cyclic Splits) . When computing the optimal split, the algorithm flipsa a probability δ coin that is entirely independent of everything else. The first time the coin landsheads, the first coordinate is chosen as the splitting coordinate; the second time, the second coordinateis chosen, and so on, such that on the J -th time the coin lands heads, the ( p mod J ) + 1 -th coordinateis chosen . After the random splitting coordinate is chosen, the splitting index may still be chosenbased on the observations. This is a modification of the random splitting assumption in [12], in which each of the p axeshas a probability δ of being chosen at each split. This could be directly implemented by flipping a pδ and selecting one of the p coordinates uniformly at random to split when the coin lands heads.Another method, studied in [13] and implemented by popular libraries such as [3], uses randomizesthe number of available splitting axes in each round. Specifically, a Poisson random variable Q withintensity proportional to √ p is first realized ( Q is realized independently from round to round).Afterwards, min( Q, p ) many axes are uniformly selected as potential candidates for splitting inthat round. Clearly, this also leads to a lower bound on the probability of being chosen for eachcoordinate 1 ≤ j ≤ p .Each of the two methods above involves two separate rounds of randomization: first, a randomvariable encoding the decision to split randomly is made (i.e., the coin or the random variable Q ),and second, the splitting axis is then determined. Intuitively, our cyclic splitting assumption aboveforgoes the second randomization step: in doing so, the variance of the number of times that anycoordinate is chosen is reduced. Importantly, the variance will depend only on δ and not on p , whichwe will exploit in our proofs. Assumption 3 (The Splitting Algorithm is ( α, k )-Regular) . There exists some α ∈ (0 , / suchthat whenever a split occurs in a node with m sample points, the two hyper-rectangles contains atleast αm many points each. Moreover, splitting ceases at a node only when the node contains lessthan k − points for some k . This key assumption is carried over from [12] and contains two requirements. The first requiresthat no split may produce a halfspace containing too few observations, i.e., that both halfspacesare large when measured the by count of observations. As shown in [14], this implies that withexponentially small complementary probability, the splitting axis shrinks by a factor between α and1 − α , so that both halfspaces also large in Euclidean volume (with high probability).The second half of the assumption places an upper bound on the number of observations interminal nodes. Trees grown under this assumption will necessarily be deeper as the sample size n As in [12], constants appearing in our bounds may change in scheme two. We adopt the convention that mJ mod J = 0, hence notation for adding 1 to ( p mod J ). Splitting for that node ceases if Q = 0. s = n β ) increases. In particular, the predictions at leaf nodes—averagesof observations Y i — will be averages of a bounded number of terms. An important consequence isthat the variance of the tree estimator (at any test point x ) is bounded below (c.f., the distributionalassumption on Var ( Y | X = x )). Assumption 4 (Predetermined Splits) . The candidate splits considered at each node do not dependon the data { Z i } , so that they are fixed ahead of time. Furthermore, the number of candidate splitsat each node is finite, and every candidate split shrinks the the length of its splitting axis by at mosta factor α . The predetermined splitting assumption is specific to our paper. That candidate splits consideredat each node are data-independent is “almost” without loss of generality since the feature space X is fixed. For example, implementations of random forests typically use 32-bit floating point numbersas their splitting index, so that the assumption is automatically satisfied. Furthermore, we notethat the assumption allows candidates splits to depend on the node, so that the splitting process isstill data-driven insofar as the sequence of splits leading up to node depends on the observations.One interpretation of this assumption is that it aids in data compression. For example, supposethat all features are continuous and X = [0 , p , and that all candidate splits have the form ( j, k/ m )for some integers 1 ≤ j ≤ p and 0 ≤ k < m , where m ≥ {(cid:98) m X ij (cid:99) : 1 ≤ i ≤ n, ≤ j ≤ p } instead of { X ij } . Each member of the former set is an integer in { , . . . , m − } and thus couldbe represented using m bits. In particular, for a grid resolution of 2 = 256—a fine grid even inmoderately large dimensions—each coordinate of the feature vectors X i may be stored in a singlebyte. Since modern CPUs and graphics processors store floating point numbers in four or eightbytes, this allows for a substantial reduction, allowing computation power to scale to to largerdatasets. The process of encoding in this way is known as quantizing , which is an option supportedby many popular software packages. In this way, though the predetermined split assumption mayseem at first glance restrictive, it aligns our model more closely with practice. Assumption 5 (Distributional Assumptions on the DGP of (
X, Y )) . The features X i are supportedon the unit cube X = [0 , p with a density that is bounded away from zero and infinity. Furthermore,the functions x (cid:55)→ E ( Y | X = x ) , x (cid:55)→ E ( Y | X = x ) , and x (cid:55)→ E ( Y | X = x ) are uniformlyLipschitz continuous. Finally, the conditional variance Var ( Y | X = x ) is bounded away from zero,i.e., inf x ∈X Var ( Y | X = x ) > . The continuity and variance bound assumptions are standard. Note that a consequence ofcontinuity and compactness of the hypercube is that the conditional moments up to order three arebounded. Our results will not explicitly depend on knowledge of the density of X : however, thedensity will affect the implicit constants that we carry throughout our proofs (c.f., Lemma 3.2 andTheorem 3.3 in [12]). 8 Gaussianity of Multivariate U -Statistics We begin investigation of the random forest estimator in this section. As discussed in the modelintroduction, the random forest estimator RF( x ) at a test point x is a U -statistic where the kernel istree estimator T ( x ) marginalized over external randomizations. This paper studies the multivariate distribution of RF, specifically the correlation structure between RF( x ) and RF(¯ x ) at distinct points x and ¯ x ∈ X . Towards that end, we shall fix a collection of q test points x , . . . , x q ∈ X (13)throughout the remainder of the paper. As these points will remain fixed, for notational brevity wewill omit their explicit dependence when writing estimators. Therefore, RF( Z , . . . , Z n ) stands forthe q -dimensional estimator that is the random forest evaluated at x , . . . , x q , given observations { Z i : 1 ≤ i ≤ n } . As a consequence of notation, most of our equations are to be understood in R q , with equality and and arithmetic acting coordinate-wise. Finally, when there is no confusion,subscripts (typically k or (cid:96) ) typically denote a specific coordinate, i.e., the estimate at the k -th or (cid:96) -th test point; a notable exception is T , which refers to a Hajek projection that we now describe. We start by reviewing properties of the H¨oeffding Decomposition of U -statistics, also known asHajek projections; see [22] for a textbook treatment of the univariate case. Let f ( Z , . . . , Z m ) ∈ R q be a generic q -dimensional statistic based on m observations. The Hajek projection of f is definedto be ˚ f ( Z , . . . , Z m ) = m (cid:88) i =1 E [ f ( Z , . . . , Z m ) | Z i ] − ( m − E f ( Z , . . . , Z m ) . (14)That is, it is the coordinate-wise projection of f to the linear space spanned by functions of theform { g ( Z i ) : 1 ≤ i ≤ m } . In particular, when f is symmetric in its arguments and Z , . . . , Z m is anIID sequence, we have ˚ f ( Z , . . . , Z m ) = m (cid:88) i =1 f ( Z i ) − ( m − E f, (15)where is the function such that f ( Z ) = E ( f | Z ), i.e., f ( z ) = E ( f | Z = z ).In our setting, applying the Hajek projection to the centered statistic RF − µ , where µ is theexpectation of RF, yields˚RF( Z , . . . , Z n ) − µ = n (cid:88) i =1 E (RF − µ | Z i ) = 1 (cid:0) ns (cid:1) n (cid:88) i =1 E (cid:20) (cid:88) i ,...,i s E ξ T ( ξ ; Z i , . . . , Z i s ) − µ | Z i (cid:21) , (16)where i , . . . , i s run through the (cid:0) ns (cid:1) size- s subsets of { , . . . , n } as usual. (Recall that RF, µ , and T are9ll vectors in R q .) Since the samples Z , . . . , Z n are independent, E ( E ξ T ( ξ ; Z i , . . . , Z i s ) | Z i ) = µ whenever i / ∈ { i , . . . , i s } . As { i , . . . , i s } runs over the size- s subsets of { , . . . , n } , there are exactly (cid:0) n − s − (cid:1) many which contain i . For each of of these subsets, E ( E ξ T ( ξ ; Z i , . . . , Z i s ) − µ | Z i ) =: T ( Z i ) − µ, (17)where T ( z ) = E ξ,Z ,...,Z s T ( ξ ; z, Z , . . . , Z s ). Therefore,˚RF − µ = 1 (cid:0) ns (cid:1) n (cid:88) i =1 (cid:18) n − s − (cid:19) ( T ( Z i ) − µ ) = sn n (cid:88) i =1 ( T ( Z i ) − µ ) . (18)The sequence of observations Z , . . . , Z n is assumed to IID, and this property is preserved forthe sequence { T ( Z i ) : 1 ≤ i ≤ n } of projections. It is easily verified that E ( ˚RF) = µ , and the pointof the previous equation is that it expresses the centered statistic ( ˚RF − µ ) as an average of centeredIID terms, scaled by s . This will be our main entry point in establishing asymptotic joint normality. The standard technique in deriving the asymptotic distribution of a U -statistic is to establish alower bound on the variance of its Hajek projection; this is the approach taken by [12] and we followthe approach here. Let V be the variance of ˚RF; using (18), we have V = Var (cid:20) sn n (cid:88) i =1 ( T ( Z i ) − µ ) (cid:21) = s n Var ( T ( Z )) = sn Var (cid:20) s (cid:88) i =1 T ( Z i ) (cid:21) = sn Var ˚ T ∈ R q × q , (19)where ˚ T is the Hajek projection of the statistic T as in (15), where T = E ξ T ( ξ, Z , . . . , Z s ) ∈ R q .Since the third moment E ( Y | X = x ) is assumed to be bounded, conditions for the LindbergCentral Limit Theorem [23] easily follows, and applying the triangular CLT, we have the familiarfact V − / ( ˚RF − µ ) dist === ⇒ N (0 , I ) , (20)where 0 is the zero vector in R q and I the q × q identity matrix. Remark.
The condition that the third moment is bounded is not necessary. The Lindberg conditionswere directly verified in [12] (c.f., Theorem 8) and their results—without assuming a bounded thirdmoment—apply to our setting as well. More recently, triangular array CLTs specific to U -statisticswere developed in [24], and their conditions are satisfied for our case as well.The asymptotic normality of the random forest estimator RF can be related to the asymptoticnormality of ˚RF via V − / (RF − µ ) = V − / (RF − ˚RF) + V − / ( ˚RF − µ ) . (21)Since the second summand on the RHS is asymptotically normal, by Slutsky’s Theorem, V − / ( RF − ) is asymptotically normal once we establish the convergence V − / (RF − ˚RF) P −−→
0. The strategyis to show that e = V − / (RF − ˚RF) converges in squared mean. We may develop its squared normvia E ( e (cid:124) e ) = E (RF − ˚RF) (cid:124) V − (RF − ˚RF) = E tr V − (RF − ˚RF)(RF − ˚RF) (cid:124) = tr V − E (RF − ˚RF)(RF − ˚RF) (cid:124) = tr V − / Var (RF − ˚RF) V − / , (22)where we used the identity tr( ABC ) = tr(
BCA ) for conforming matrices A , B , and C . That traceon the extreme RHS goes to zero is the natural multivariate generalization of the familiar condition Var ( f − ˚ f ) Var ˚ f → U -statistics [22].In the univariate setting, the previous condition is checked by considering higher order decompo-sitions proceeds by considering higher order decompositions of the statistic f ; this approach is alsovalid in the multivariate setting, as shown below. As we will see, the more substantive difficulty isthat the dimension of the kernel of the U -statistic—namely, the subsample size s of each tree—growswith the sample size.By Proposition 1 below, we may expand RF − ˚RF according to a H¨oeffding decomposition taken respect to the matrix V − ,RF − ˚RF = 1 (cid:0) ns (cid:1) (cid:20)(cid:88) i Var ˚ T − and Var T . Specifically, the authors obtain( Var T ) kk ( Var ˚ T ) kk ≤ c (log s ) p , for each k = 1 , . . . , q, (28)for some constant c . As we will see in the next section, the required bound on the trace will followby developing bounds on the off-diagonal elements of Var ˚ T , i.e., bounds on the covariance betweenrandom forest estimates at different test points (see discussion following Proposition 2). Proposition 1 (H¨oeffding Decomposition for Multivariate U -statistics) . Fix a positive definitematrix M . Let f ( x , . . . , x n ) ∈ R p be a vector-valued function that is symmetric in its argumentsand let X , . . . , X n be a random sample such that f ( X , . . . , X n ) has finite variance. Then thereexists functions f , f , . . . , f n such that f ( X , . . . , X n ) = E ( f ) + n (cid:88) i =1 f ( X ) + (cid:88) i The aim of this section is to establish asymptotic bounds on the off-diagonal elements of the thecovariance matrix Var ˚ T . We shall show that when the tree estimator employs splitting algorithmssatisfying suitable stability conditions , we have the asymptotic behavior( Var ˚ T ) k,l = o ( s − (cid:15) ) for all 1 ≤ k (cid:54) = l ≤ q and some (cid:15) > . (31)Before proceeding, we first show that that this bound, coupled with control on the diagonalterms (28), suffices to establish the trace bound in (27) vanish. Proposition 2. The entries of Var T are bounded and its diagonal entries are bounded away fromzero. Furthermore, when Var ˚ T satisfies the condition in (31) , sn tr( Var ˚ T − Var T ) → . (32) Remark. The first part of the Proposition, concerning the entries of Var T , is a consequence of our( α, k )-regularity assumption and distribution assumptions on { Z i } . As discussed in the assumptionsection, since the number of observations in leaf nodes are bounded above, the (pointwise) varianceof the tree estimator at x is bounded above by (a constant times) Var ( Y | X = x ), and we assumedthat the latter function is bounded away from zero. That the off-diagonal entries are bounded is a12rivial consequence of the fact that E ( Y | X = x ) is Lipschitz and thus bounded. The techniqueswe present to bound Var ˚ T could also be used to bound Var T ; it is in fact true that ( Var T ) k,l → k (cid:54) = l , though we will not pursue this further in this paper.Proposition 2 establishes ˚ T as the central object of study. Recall that T is the tree estimatorwhile ˚ T is its Hajek projection; in other words, Var ˚ T is not covariance matrix of tree estimates.However, our result will demonstrate the the asymptotic normality V − (RF − µ ), where V , thevariance of the Hajek projection ˚RF, is given in terms of Var ˚ T (c.f., (19)). Therefore, (a rescaledversion of) ˚ T is precisely the object needed to conduct inference on the random forest. In particular,combining (28) and (31) yields the fact that Var ˚ T —and hence the asymptotic variance of RF—isdiagonally dominant (i.e., tending to a diagonal matrix in the limit).We may always relabel indices so that the tree is grown on the observations Z , . . . , Z s . Toestablish the bound 32, start with the definition˚ T − µ = s (cid:88) i =1 E ( T | Z i ) so that Var ˚ T = s Var ( E ( T | Z )) due to independence . (33)To develop the term on the RHS, use the orthogonality condition for conditional expectation Var E ( T | Z ) = Var [ E ( T | Z ) − E ( T | X )] + Var [ E ( T | X )] . (34)Since the tree algorithm is honest, the difference E ( T | Z ) − E ( T | X ) simplifies, so that for each1 ≤ k ≤ q , E ( T k | Z ) − E ( T k | X ) = E ( I k | X )( Y − E ( Y | I k = 1 , X )) , (35)where T k is the tree estimate at x k , and I k the indicator for whether X and x k belong to the sameterminal node. Therefore, off-diagonal entry at ( k, l ) of Var [ E ( T | Z ) − E ( T | X )] is equal to E [ E ( I k | X ) E ( I l | X )( Y − E ( Y | X , I k = 1)( Y − E ( Y | X , I l = 1)] . (36)If we expand the terms in the integrand, every term will have the shape E ( I k | X ) E ( I l | X ) · p ( Y , E ( Y | X , I k = 1) , E ( Y | X , I l = 1)) (37)for some multinomial p of degree at most two. Since we have assumed that E ( Y | X = x ) and E ( Y | X = x ) are continuous and hence bounded, E ( p | X = x ) is also bounded. Using the Lawof Iterated Expectations to evaluate (36) then shows that it is bounded by a constant times E [ E ( I k | X ) E ( I j | X )] . (38) Remark. A direct application of the Cauchy-Schwarz inequality, using only the fact that E ( Y | X = x )13s bounded, would yield the weaker bound (cid:113) E [ E ( I k | X ) E ( I j | X ) ] ≤ (cid:113) E [ E ( I k | X ) E ( I j | X )] (39)up to a multiplicative constant.Recall that I k and I l are indicator variables for whether X belong to the same hypercube as x k and x l , respectively. Therefore, E ( I k | X ) is the probability that the first observation is usedfor the prediction at x k , and likewise for E ( I l | X ). Intuitively, this only happens when X isnear x k (respectively, x l ): since x k (cid:54) = x l , X cannot be near to both, meaning that the product E ( I k | X ) E ( I l | X ) is small. Proposition 3. For two points x and ¯ x ∈ X = [0 , p , define M ( x, ¯ x ) = E [ E ( I | X ) E ( ¯ I | X )] , (40) where I and ¯ I are indicators for X belonging to the same terminal node as x and ¯ x , respectively.If δ > / and x (cid:54) = ¯ x , M ( x, ¯ x ) = o ( s − (1+ (cid:15) ) ) for some (cid:15) > . (41) Remark. It is instructive to consider the bound in the preceding display versus M ( x, x ). It is clearfrom the definition that M ( x, x ) ≥ M ( x, ¯ x ) for all ¯ x . In addition, M ( x, x ) = E ( E ( I | X ) ) ≤ E ( E ( I | X )) = E ( I ). By symmetry, E I = 1 /s (up to constant), as the terminal node at x has abounded number of observations. Therefore, all that the Proposition ensures is that when x (cid:54) = ¯ x ,the quantity M ( x, ¯ x ) is smaller than the “trivial” bound 1 /s .This proposition shows that the contribution of Var [ E ( T | Z ) − E ( T | X )] to the crosscovariances of Var E ( T | Z ) is small, in particular smaller than the required bound (log p s · s ) − .The requirement that δ > / 2, while needed for the proof to go through, is almost certainly notneeded in practice. The reason is that our proof uses δ > / uniform bound on thequantity E ( I k | X ) E ( I l | X ) , (42)while proposition demands a bound on its expectation. Indeed, in the extreme case x = 0 and¯ x = (1 , . . . , (cid:124) , it is easy to see that the expectation satisfies the required bound even when δ ≤ / M ( x, ¯ x ) will be smallerthan that predicted by (41). In light of this, an alternative to our cyclic splitting assumption is toassume the high level condition that the splitting algorithm and data generating process confer thebound M ( x, ¯ x ) = o (cid:18) p s · s (cid:19) . (43)14 .4.1 Bounding Var E ( T | X )We turn next to bound the off-diagonal terms in Var [ E ( T | X )]. As in the statement of Proposition 3,it will be convenient to slightly change notations. We fix 1 ≤ k (cid:54) = l ≤ q , and use the notation x (cid:55)→ x k , ¯ x (cid:55)→ x l , with x denoting the value of X . The goal of this section is to establish the bound Var [ E ( T | X )] kl = E ( E ( T | X = x ) − µ )( E ( ¯ T | X = x ) − ¯ µ )) = o (cid:18) log p ss (cid:19) . (44)where T and ¯ T are the tree estimates at x and ¯ x , and µ and ¯ µ are the (unconditional) expectations of T and ¯ T . (Note that we have also changed the notation of T slightly; before T was the q -dimensionalvector of the estimate at all points; for this section, it is the pointwise estimate at x = x k .)The quantity E ( T | X = x ) − µ = E ( T | X = x ) − E ( T ) measures the degree of “information”that the location of a single observation X carries for the output of the tree at x . Intuitively,when x is near x , the effect of x on the leaf node containing x is more pronounced and we expect E ( T | X = x ) − µ ≈ E ( Y | X = x ) − µ . Conversely, when x is far from x , then its effect indetermining the location of the leaf node containing x diminishes, and E ( T | X = x ) − µ ≈ X = x leaves theintermediate partition containing x , where “intermediate partition” refers to the nodes createdduring the splitting process that are not necessarily terminal. We will see that once X and x areseparated, its effect on the prediction decreases.Towards this end, fix x and let Π denote the terminal node containing x ; Π is a random subsetof X created by axis aligned splits. By Assumption 4, the set of potential splits does not depend onthe sample (in particular, it does not depend on X ). Moreover, splitting ceases after no more than s times, regardless of the subsample X , . . . , X s , as each split reduces the number of observations inits two children nodes by at least one. Therefore, Π takes on only finitely many possible values, andwe may write E ( T ) = (cid:88) π P (Π = π ) µ π and E ( T | X = x ) = (cid:88) π P (Π = π | X = x ) µ (cid:48) π (45)where µ π = E ( T | Π = π ) and µ (cid:48) π = E ( T | Π = π, X = x ).The hyperrectangle Π is determined by the recursive splitting procedure used to grow the tree,and there is a natural correspondence between (45) and a certain “expectation” taken over the directed acyclic graph (DAG) defined in the following way. Let [0 , p be the root of the DAG; forevery potential split at [0 , p , there is a directed edge to a new vertex, where the vertex is thehyperrectangle that contains x . If the node represented by a vertex is one of the possible values ofΠ, then that vertex is a leaf in the DAG and has no outgoing edges; other vertices carry an outgoingedge for each potential split at that node, with each edge going to another vertex which is again ahyperrectangle containing x .The previous definition determines the DAG recursively: each vertex in the DAG is a nodecontaining x , with terminal vertices corresponding to terminal nodes. To each terminal vertex v , we15ssociate the value f ( v ) : − µ π as in (45). In addition, each edge e = ( v → w ) corresponds to a splitat a node v producing a halfspace w of v ; associate with this edge a “transition probability” p ( e ) : − P ( s is chosen at v | current node is v ) =: P ( w | v ) . (46)Given the transition probabilities, the value f may be extended to each vertex v recursively via theformula f ( v ) : − (cid:88) e : v → w P ( w | v ) f ( w ) . (47)We refer to f as the continuation value at v , and by construction we have E ( T ) = f (“root”) = f ([0 , p ) . (48)Alternatively, if we had assigned the values f (cid:48) ( v ) = µ (cid:48) v to each terminal vertex and used thetransition probabilities p (cid:48) ( e ) = P ( s is chosen at v | current node is v, X = x ) = P (cid:48) ( w | v ) , (49)then we recover E ( T | X = x ) = f (cid:48) ([0 , p ) after extending f (cid:48) in the same way as f . In otherwords, bounding E ( T | X = x ) − E ( T ) requires bounding the difference between the two types ofcontinuation values.We will need to assume that p (cid:48) ( e ) ≈ p ( e ); that is, conditioning on a single observation will notaffect the probability that a particular split is chosen (i.e., that a particular split is optimal at itsnode). This is a natural assumption in that the optimal split is computed using all the observationsin a particular node, so that conditioning on a single observation should have relatively little effect.This will depend on the specifics on the splitting algorithm used to a construct a tree, and ruleswhich satisfying the following assumption to be “stable splitting rules.” Assumption (Splitting Stability). For any node v , the total variation distance between thedistributions { p ( e ) } e : v → w and { p (cid:48) ( e ) } e : v → w is bounded by the volume of v . Specifically, there existssome (cid:15) > such that for all v , TV( p, p (cid:48) ) ≤ (cid:18) s | v | (cid:19) (cid:15) (up to a constant). (50) Here, | v | denotes the volume of the hyperrectangle at v , i.e., | v | = (cid:12)(cid:12)(cid:12)(cid:12) p (cid:89) j =1 ( a j , b j ) (cid:12)(cid:12)(cid:12)(cid:12) = p (cid:89) j =1 | b j − a j | . (51) Remark. Since p and p (cid:48) are discrete probability distributions: thus, if p and p (cid:48) are written as vectorsof probability masses, then the total variation distance is the L norm between the two vectors.Since the distribution of X has a density that is bounded above and below, a simple H¨oeffding16ound shows that the number of sample points in v is bounded above and below by s | v | , withthe constants adjusted so that the failure probability is less than /s . Since this is smaller thanthe required bound (log p s · s ) − , we may interpret s | v | in (50) to be the number of samples in | v | without loss of generality. Relatedly, [14]’s Lemma 12 (see also the proof of Lemma 2 in [12])extends to this fact to be uniform across nodes.The stability assumption places a restriction on procedure used to select optimal splits: namely,if the decision is made on the basis of m points, then conditioning on any one of the points changesthe optimal split with probability bounded by m − (1+ (cid:15) ) . In practice, most splitting procedures satisfya stronger bound. A set of sufficient conditions is given in the following proposition. Proposition 4. Assume that the optimal split at a node v is chosen based on the quantities f ( µ , . . . , µ Q ) , . . . , f P ( µ , . . . , µ Q ) (52) for some Q ≥ , where µ , . . . , µ Q are the sample averages of the points being split µ k = 1 n v (cid:88) i : X i ∈ v m k ( X i ) (53) where the sum runs over points in v , and n v denote the number of these points.Specifically, suppose optimal split is decided based on which f i achieves the largest value, i.e.,the value arg max i f i ( µ ) . If f , . . . , f P are Lipschitz, and the functions m , . . . , m Q are such that m k ( X ) is 1-subExponential, then the splitting stability assumption is satisfied.Remark. Since X i are bounded, the requirement that m k ( X ) is subExponential allows the use of(8) to compute the optimal split.In general, the conditions in Proposition 4 are sufficient to guarantee an exponential boundinstead of a polynomial one as in (50). Thus, Proposition 4 should be viewed as simply as providinga plausibility argument that stable splitting rules are commonly encountered in practice.The next proposition shows that bounds on splitting probabilities automatically imply a relatedbound on the continuation values. Proposition 5. Suppose the splitting probabilities satisfy a generic bound ∆( · ) in that TV( p, p (cid:48) ) ≤ ∆( s | v | )log s at each node v. (54) For example, ∆( z ) = z − (1+ (cid:15) ) . Then for any node v containing x but not x , | f ( v ) − f (cid:48) ( v ) | ≤ C ∆( s | v | ) (55) for some constant C not depending on v . For example, the probability that a binomial random variable B ( n, p ) deviates from np by more than √ n log n isless than C/s for some constant C . z ) = z − (1+ (cid:15) ) , where the factor (cid:15) allows us toignore the extra logarithm. In that case, we may put the bounds on TV( p, p (cid:48) ) and | f − f (cid:48) | togetherand establish required bound on Var E ( T | X ). Proposition 6. Suppose that the splitting rule is stable as in (50) and that δ > − α . For x (cid:54) = x , | E ( T | X = x ) − E ( T ) | = o (cid:18) s (cid:15) (cid:19) (56) for some (cid:15) > . In particular, the off-diagonal entries of Var E ( T | X ) are o ( s − (1+ (cid:15) ) ) as at leastone of x and ¯ x is distinct from x . Recall that Proposition 3 required that δ > / 2. Since α < / δ > − α in Proposition 6 is more restrictive. Just like Proposition 3, we argue that thisrequirement is plausibly looser in applications. The reason is that it is used to give the followingbound on hyperrectangles v created after L splits | v | ≥ α L . (57)The RHS appears since potential split may reduce the volume of a node by at most α : but only anexponentially small (i.e., 2 − L ) proportion of nodes are the result of taking the smallest possiblesplit L times! The “average” node has volume approximately (1 / L , so that δ > / Combining Propositions 3 and 6 with equations (33) and (34) yields the desired bound (31) on theoff-diagonal terms of Var ˚ T discussed at the beginning of this section. Therefore, Proposition 2applies, and the joint normality of the random forest estimator is established. The previous sections focused on deriving the asymptotic normality result V − / (RF − µ ) dist === ⇒ N (0 , I ) , where V = Var ˚RF = sn Var ˚ T . (58)Recall our standing convention that RF ∈ R q is the random forest estimate at x , . . . , x q and µ is its expectation. According to (9), the target function of the random forest is actually is m ( x ) = E ( Y | X = x ). The results in [12] show that (RF( x ) − m ( x )) / √ V dist === ⇒ N (0 , 1) pointwisefor each x ∈ X ; since we have shown that V is diagonally dominant in that its off-diagonal termsvanish relative to the diagonal, the pointwise result carries over to our multivariate setting, and V − / (RF − m ) dist === ⇒ N (0 , I ), where m = ( m ( x ) , . . . , m ( x q )).18oreover, [12] proposes a jackknife estimator that can consistently estimate √ V in the scalarcase. Our diagonal dominance result implies that the random forest estimates at x and ¯ x ∈ [0 , p are independent in the limit n → ∞ , Var (RF( x ) + RF(¯ x )) = Var (RF( x )) + Var (RF(¯ x )) + 2 Cov (RF( x ) + RF(¯ x )) ≈ Var (RF( x )) + Var (RF(¯ x )) , (59)so that the jackknife estimator for the scalar case may be fruitfully applied to obtain confidencebands for functionals of the random forest estimates (i.e., expressions involving the estimate atmore than one point).The accuracy of the approximation above above depends on the decay of the off-diagonal termsin finite samples. In this section, we provide a “back of the envelope” bound for the covariance termthat may be useful for practitioners. We stress that the following calculations are (mostly) heuristics:as we have shown above, the covariance term depends on quantities such as M ( x, ¯ x ), which is inturn heavily dependent on the exact mechanics of the underlying splitting algorithm. Since ouraim is to produce a “usable” result, we will dispense with rigorous analysis in the remainder of thissection.To begin, the proofs of Propositions 3 and 6 showed that the asymptotic variance V hasoff-diagonal terms which are upper bounded by M ( x, ¯ x ) + log s (cid:18) ∞ (cid:88) (cid:96) =0 p (cid:96) (cid:19)(cid:18) ∞ (cid:88) (cid:96) =0 ¯ p (cid:96) (cid:19) = M ( x, ¯ x ) + log s E ( L ) E ( ¯ L ) (60)where p (cid:96) = P ( L ≥ (cid:96) ) is the probability that x and x are not separated after (cid:96) splits and likewisefor ¯ p (cid:96) = P ( ¯ L ≥ (cid:96) ). That is, L is the number of splits before which x and x belong to the samepartition. If we denote by I (resp. ¯ I ) the indicator variable that X is in the terminal node of x (resp. ¯ x ), then the events { I = 1 } and { L = log s } are equal, so that E ( I | X = x ) = P ( L = log s ) ≤ E L log s . (61)Replacing the inequality with an approximation, we have E L = (log s ) E ( I | X = x ). All of thisshows that the covariance term is bounded by(log s ) E ( I | X = x ) E ( ¯ I | X = x ) ≈ (log s ) M ( x, ¯ x ) . (62) Remark. Taken loosely, this heuristic says that the random forest estimator RF, considered as afunction on the domain X , is asymptotically Gaussian with covariance process (log s ) · M ( x, ¯ x ).We stress that this is not implied by our theoretical results, as there we kept the number q of testpoints fixed.Towards a useful heuristic, we will consider a bound on the correlation instead of the covariance. This is a very crude upper bound as we have dropped the quantity ∆( α (cid:96) s ) from the infinite series. 19n our notation, the result of [12] lower bounds M ( x, x ) (and M (¯ x, ¯ x )), while our paper provides an upper bound on M ( x, ¯ x ). Ignoring the logarithmic terms, we have (cid:12)(cid:12)(cid:12)(cid:12) Cov (RF( x ) , RF(¯ x )) (cid:112) Var RF( x ) · Var RF(¯ x ) (cid:12)(cid:12)(cid:12)(cid:12) ≈ M ( x, ¯ x ) (cid:112) M ( x, x ) M (¯ x, ¯ x ) . (63)Recall that M ( x, ¯ x ) = E [ E ( I | X ) E ( ¯ I | X )], which decays as ¯ x moves away from x . Using theprevious expression (note that M ( x, x ) ≈ M (¯ x, ¯ x ) due to symmetry between x and ¯ x ), we canbound the correlation from purely geometric considerations. Since the integrand E ( I | X ) E ( ¯ I | X ) (64)decays as X moves away from x (and ¯ x ), we may imagine that its integral M ( x, x ) = (cid:90) x E ( I | X = x ) dx (65)has the largest contributions for points x near x , say those points in a L ∞ -box of side lengths d with volume d p , i.e., { y ∈ [0 , p : (cid:107) x − y (cid:107) ≤ d/ } . If we accept this, then the contributions for theintegral M ( x, ¯ x ) would come from points that are with d/ x and ¯ x , and to a first degreeapproximation, the volume of these points { y ∈ [0 , p : (cid:107) x − y (cid:107) ∞ ≤ d/ , (cid:107) ¯ x − y (cid:107) ≤ d/ } is( d − z ) . . . ( d − z p ) ≈ d p − ( z + · · · + z p ) d p − , where z i = | x j − ¯ x j | , (66)where the approximation is accurate if | z i | (cid:28) 1. Dividing through by d p , the proportion of thevolume of the latter set is 1 − d (cid:107) x − ¯ x (cid:107) , which leads to the heuristic (cid:12)(cid:12)(cid:12)(cid:12) Cov (RF( x ) , RF(¯ x )) (cid:112) Var RF( x ) · Var RF(¯ x ) (cid:12)(cid:12)(cid:12)(cid:12) ≈ − c (cid:107) x − ¯ x (cid:107) , for some constant c. (67)The RHS has the correct scaling when x = ¯ x , i.e., the correlation equals one when (cid:107) x − ¯ x (cid:107) = 0. Tomaintain correct scaling at the other extreme with (cid:107) x − ¯ x (cid:107) = p , we should take c = 1 /p , so that (cid:12)(cid:12)(cid:12)(cid:12) Cov (RF( x ) , RF(¯ x )) (cid:112) Var RF( x ) · Var RF(¯ x ) (cid:12)(cid:12)(cid:12)(cid:12) = 1 − p p (cid:88) i =1 | x i − ¯ x i | . (68)Of course, this heuristic must be incorrect in that it does not depend on s ; our theoreticalresults show that even for non-diametrically opposed points, the correlation drops to zero as s → ∞ .Therefore, another recommendation is to use (cid:12)(cid:12)(cid:12)(cid:12) Cov (RF( x ) , RF(¯ x )) (cid:112) Var RF( x ) · Var RF(¯ x ) (cid:12)(cid:12)(cid:12)(cid:12) = min (cid:18) − s (cid:15) p p (cid:88) i =1 | x i − ¯ x i | , (cid:19) , (69)for some (cid:15) > 0, where dependence s (cid:15) comes from considering the decay of M ( x, ¯ x ) as ¯ x moves away20rom x (c.f. the proof Proposition 3). In this section, we discuss the results of simulations calculating the correlation structure. Forour experiment, we set p = 2, so that the covariates X are distributed on the unit square. Thedistribution of X is chosen to be “four-modal” X ∼ ¯ N ( µ , I ) with probability 1 / N ( µ , I ) with probability 1 / N ( µ , I ) with probability 1 / N ( µ , I ) with probability 1 / µ = (0 . , . (cid:124) µ = (0 . , . (cid:124) µ = (0 . , . (cid:124) µ = (0 . , . (cid:124) (70)and ¯ N denotes a truncated multivariate Gaussian distribution on the unit square. Thus, X hasa bounded density on the unit square, and has four peaks at µ , . . . , µ . The distribution of Y conditional on X = ( x , x ) is Y ∼ x + x N (0 , . (71)The random splitting probability is δ = 1 / 2, and the regularity parameters are α = 0 . 01 and k = 1,so that the tree is grown to the fullest extent (i.e., terminal nodes may contain a single observation),with each terminal node lying on the 101 × 101 grid of the unit square. For each sample size n , fivethousand trees are grown, and the estimates are aggregated to compute the correlation.Figure 1 plots the correlation of estimates at x and ¯ x as a function of the L norm (cid:107) x − ¯ x (cid:107) .The calculation is performed by first fixing x , then calculating the sample correlation (across fivethousand trees) as ¯ x ranges over each cell: the correlation is associated with the L norm (cid:107) x − ¯ x (cid:107) .This process is then repeated by varying the reference point x , and the correlation at (cid:107) x − ¯ x (cid:107) is theaverage of the correlations observed. The figure demonstrates that the linear heuristic (69) given inthe previous section is conservative: it is evident that correlation decreases super-linearly as x and¯ x become separated.Figure 2 plots the correlation on a logarithmic scale, which shows that that correlation decayis exponential in a neighborhood of unity. In other words, simulations suggest that the correctheuristic is of the shape (cid:12)(cid:12)(cid:12)(cid:12) Cov (RF( x ) , RF(¯ x )) (cid:112) Var RF( x ) · Var RF(¯ x ) (cid:12)(cid:12)(cid:12)(cid:12) ≈ e − λ (cid:107) x − ¯ x (cid:107) (72)for a suitable λ . That is, ¯ N ( µ, Σ) denotes the conditional distribution of x ∼ N ( µ, Σ) on the event x ∈ [0 , . .000.250.500.751.00 0.0 0.5 1.0 1.5 L1 Norm C o rr e l a t i on Correlation with Varying Sample Size and L1 norm Figure 1: Correlation as a function of sample size and L norm. −10.0−7.5−5.0−2.50.0 0.0 0.1 0.2 0.3 L1 Norm Log ( C o rr e l a t i on ) n Log Correlation with Varying Sample Size and L1 norm Figure 2: Logarithm of correlation as a function of sample size and L norm. Random forests and tree-based methods form an important part of of an applied data analysistoolkit. In this paper, we studying the covariance between random forest estimates at severalpoints. We develop a novel construction of a directed acyclic graph that keeps track of the splittingprobabilities when knowledge of one point is known (Propositions 5 and 6). As part of the proof,we establish stability properties of a class of splitting rules (see Proposition 4). We also identify(Proposition 3) M ( x, ¯ x ), which (roughly) captures the likelihood of two points belonging to thesame terminal node, as a key quantity in controlling the off-diagonal term of the covariance matrixof the multivariate random forest.In this way, this paper provides the a theoretical basis for performing inferences on functionalsof target function (e.g., a heterogeneous treatment effect) when the functional is based on values of22he target function at multiple points in the feature space. Specifically, we show that the covariancevanishes in the limit relative to the variance, and provide heuristics on the size of the correlation infinite samples.We close with discussing a couple avenues for future research. The first is extending ourframework to cover categorical or discrete-valued features. Here, new assumptions would be requiredin order to maintain the guarantee that node sizes are “not too small.” Second, our bounds—afterpotential improvements—on the covariance matrix of the random forest may be used with the recentresults of [16, 17] in order to provide finite sample Gaussian approximations. This would provide asounder theoretical underpinning of for our heuristics, and increase the usefulness of this paper forpractitioners. A Proofs Proof of Proposition 1. For random vectors in R q , define the inner product (cid:104) X, Y (cid:105) : − E ( X (cid:124) M Y ) . (73)For each subset A ⊆ { , . . . , n } , let H A be the set of square-integrable random vectors of the form g ( X i : i ∈ A ), where g is a function of | A | arguments, satisfying the condition that E ( g ( X i : i ∈ A ) | { X i : i ∈ B } ) = 0 (74)for all subsets B (cid:40) A . It is easy to see that collection H A are pairwise orthogonal as A ranges oversubsets { , . . . , n } . By induction on r = | A | , the direct sum (cid:76) B ⊆ A H B is equal to the set of allstatistics which are functions of { X i : i ∈ A } . In particular, (cid:76) A H A is the set of all statistics basedon { X , . . . , X n } . When the variables { X , . . . , X n } are IID, then H A depends only on | A | in thatthere exist collections of functions H , H , . . . , H n , where H k is a collection of k -ary functions, suchthat H A = { g ( X i : i ∈ A ) : g ∈ H | A | } . (75)The proof is complete by letting f k be the projection of f onto H k according to the inner productgiven in (73). Proof of Proposition 2. We will prove the slightly more general statement that if A n and B n aretwo sequences of square matrices with bounded entries such that B ii > δ for some δ for all n and A ii ≥ B ii log n (76)and A ij = o (1 / log n ), then tr( A − B ) → 0. To prove this, start with the determinant formuladet A = (cid:80) π ( − sgn π (cid:81) qi =1 a iπ i , where the sum runs over permutations π of { , . . . , n } and sgn π isthe sign of the permutation. Since the off-diagonal entries of A ij are assumed to vanish relative to23 ii , we have | det A | ∼ (cid:81) A ii , where the notation a ∼ b stands for c | b | ≤ | a | ≤ c (cid:48) | b | for constants c and c (cid:48) not depending on n . Next, recall Cramer’s Rule( A − ) ii = det A − i / det A, (77)where A − i is the matrix A with its i -th row and i -th column removed. A similar argument showsthat | det A − i | ∼ (cid:81) j (cid:54) = i A jj , whence ( A − ) ii ∼ A ii . (78)In particular, the i -th diagonal entry of the matrix A − B is given by( A − B ) ii = ( A − ) ii B ii + (cid:88) j (cid:54) = i ( A − ) ij B ji ∼ B ii A ii ≤ log n, (79)where the final relation is due to the fact that ( A − ) ij is itself a polynomial in the entries of A (viz.,the cofactor matrix of A ) divided by the determinant. Therefore, the trace of A − B is on the orderof log n , since the dimension q × q of each matrix is fixed. Using the subsample size s = n β , so that s/n = n − (1 − β ) completes the proof. Proof of Proposition 3. Recall that the splitting algorithm has a probability δ chance of splitting onthe j -th axis. Since each terminal node contains a constant number of points, the number of terminalnodes is equal (up to constant) to the subsample size s . Therefore, the number of splits required toreach a terminal node is bounded (by a constant) by log s/K = log s/K , where K = 2 k − x (cid:54) = ¯ x , we have 0 < (cid:107) x − ¯ x (cid:107) ∞ ≤ (cid:107) x − x (cid:107) ∞ + (cid:107) ¯ x − x (cid:107) ∞ (80)for all x ∈ X . In particular, given any x there exists some j ∈ { , . . . , p } and a constant β forwhich either | x j − x j | > β or | ¯ x j − x j | > β . Without loss of generality, we may assume that theformer case holds. Certainly, a necessary condition for X = x to belong to the same leaf node as x (i.e., a necessary condition for { I = 1 } ) is for the length of the first axis of that leaf node to belarger than β .Let c j ( x ) denote the number of splits in coordinate j along the sequence of splits leading tothe terminal node containing x . By our randomization assumption, each split has at least anindependent chance δ of being chosen, and since we cycle through each coordinate (c.f., Assumption2), c j ( x ) (cid:23) p B (cid:18) log sK , δ (cid:19) where (cid:23) stands for stochastic dominance . (81)Per Assumption 4, that each split along the j -th axis decreases its length by a factor of at least241 − α ). Since splitting begins in the unit hypercube,(1 − α ) c ( x ) ≥ β = ⇒ c ( x ) ≤ log β log(1 − α ) =: ρ. (82)Since { I = 1 } requires that the length of the first axis to exceed β (a constant), this proves E ( I | X = x ) ≤ P (cid:20) B (cid:18) log sK , δ (cid:19) ≤ pρ (cid:21) . (83)Since pρ is a constant, we may conclude P (cid:20) B (cid:18) log sK , δ (cid:19) ≤ pρ (cid:21) ≤ (1 − δ + o (1)) log s/K = (cid:18) s (cid:19) log − δ + o (1) . (84)Finally, recall base of the logarithm is two since the tree is binary. Therefore, if we choose δ > / 2, the exponent exceeds 1 and the proof is complete. Proof of Proposition 4. The easiest case is the splitting decision in the root node [0 , p , so we starthere. We prove the result by introducing a coupling between the split decisions with and withoutconditioning on X = x S = arg max i f i (cid:18) s s (cid:88) i =1 m ( X i ) , . . . , s s (cid:88) i =1 m Q ( X i ) (cid:19) =: f i S (cid:48) = arg max i f i (cid:18) s (cid:20) m ( x ) + s (cid:88) i =2 m ( X i ) (cid:21) , . . . , s (cid:20) m Q ( x ) + s (cid:88) i =2 m Q ( X i ) (cid:21)(cid:19) =: f (cid:48) i . (85)Here, S is the split made on the sample X , . . . , X s and S (cid:48) is the split made on the sample conditionalon X = x . Note that we may assume without loss of generality that the splits are not randomlychosen, since on that event the splitting probabilities are trivially equal. Clearly, a necessarycondition for S (cid:54) = S (cid:48) is the existence of a pair 1 ≤ i (cid:54) = j ≤ P for which f i > f j but f (cid:48) j > f (cid:48) i . (86)Since f is Lipschitz and its arguments are subExponential by assumption, the quantities f i and f j concentrate around their respective limits f i ( E m , . . . , E m Q ) and f j ( E m , . . . , E m Q ); hence,whenever f i ( E m , . . . , E m Q ) > f j ( E m , . . . , E m Q ) we will have f i − f j > s with probability at least 1 − O ( e − cs ) for some constant c . (87)However, the difference of the arguments of f i in (85) differ by at most 1 /s , the difference due tothe | m ( x ) − m ( X ) | /s . By Lipschitz continuity, a change of 1 /s in the arguments changes thefunction values by a proportional amount, so f (cid:48) j > f i is impossible when f i − f j > /s . It followsthat ( f i > f j , f (cid:48) j > f (cid:48) i ) occurs with probability at most O ( e − cs ), and we finish by noting taking a25nion over the (cid:0) P (cid:1) pairs ( i, j ). This result result for [0 , p will be referred to as the base case.Note that the above actually proves something stronger, namely that for every split τ , P ( S (cid:54) = s | S (cid:48) = τ ) < e − cs and P ( S (cid:48) (cid:54) = s | S = τ ) < e − cs . (88)It follows that see that for any τ , the total variation distance between ( X , . . . , X s | S = τ ) and( X , . . . , X s | S (cid:48) = τ ) is at most e − cs . To see this, note that S and S (cid:48) are functions of X i only, sothat the densities of these two distributions are p ( x ) = ( S ( x ) = τ ) p ( x ) P ( S = τ ) and p (cid:48) ( x ) = ( S (cid:48) ( x ) = τ ) p ( x ) P ( S (cid:48) = τ ) (89)respectively. We may assume without loss of generality that P ( S (cid:48) = τ ) ≥ P ( S = τ ) so that thetotal variation is (cid:90) | p ( x ) − p (cid:48) ( x ) | = (cid:90) S = τ p ( x ) − p (cid:48) ( x ) + (cid:90) S (cid:54) = τ,S (cid:48) = τ p (cid:48) ( x )= 1 − P ( S (cid:48) = τ, S = τ ) P ( S (cid:48) = τ ) + P ( S (cid:48) = τ, S (cid:54) = τ ) P ( S (cid:48) = τ ) = 2 P ( S (cid:54) = τ | S (cid:48) = τ ) < e − cs . (90)The upshot is that when considering the splitting probability in the next node, we can ignore thedifference in the distribution of X , . . . , X s when conditioning on S = τ versus conditioning on S (cid:48) = τ and pay a cost O ( e − cs ).Now consider bounding the difference of the splitting probabilities at the next split P ( S = s | S = τ ) − P ( S = s | S = τ, X = x ) . (91)Again, the strategy is to find a coupling ( S , S (cid:48) ) such that S ∼ ( S | S = τ ) and S (cid:48) ∼ ( S | S = τ, X = x ) (92)with TV( S , S (cid:48) ) ≤ e − s | v | , | v | being the hyper-rectangle corresponding to one of the halfspaces pro-duced by τ . Since the distribution of X , . . . , X s conditional on S = τ differs from its unconditionaldistribution by an amount e − cs in the total variation distance, we could use the following coupling S = arg max f i (cid:18) n v (cid:88) X i ∈ v m ( X i ) , . . . , s | v | (cid:88) X i ∈ v m Q ( X i ) (cid:19) S (cid:48) = arg max f i (cid:18) n v (cid:88) X (cid:48) i ∈ v m ( X (cid:48) i ) , . . . , s | v | (cid:88) X i ∈ v m Q ( X (cid:48) i ) (cid:19) (93)where X i follows the distribution of ( X , . . . , X s ) conditional on S = τ and X (cid:48) i follows thedistribution conditional on S = τ and X = x . By conditioning on S = τ instead of S (cid:48) = τ , weincrease the total variation by an amount e − cs via the triangle inequality.26ow the rest of the proof is the same as in the base case, noting that with high probability, thenumber n v of points in v is equal to s | v | up to an multiplicative constant (1 − η ) with probability e − sη . The previous bounds are applied recursively at each depth l of the DAG. At depth l , weincur an “approximation cost” from the total variation distance bounded by e −| v | s . Since s | v | ≥ α l ,it follows that l ≤ O (log s | v | ), whence the cumulative cost at depth l is O (log s | v | · e − c | v | s ). Puttingeverything together, we have proven thatTV( p, p (cid:48) ) ≤ O (log( | v | s ) · e − c | v | s ) ≤ o (cid:18) s | v | (cid:19) (cid:15) for some (cid:15) > . (94) Proof of Proposition 5. The claim is trivially true (by choosing an appropriate constant) if v is aterminal node. Thus, fix a non-terminal node v such that x / ∈ v and let X = X v = { X i : X i ∈ v } (95)denote the set of points landing in v , so that k : − | X | ∈ { , . . . , n − } .Recall that f = f ( v ) and f = f (cid:48) ( v ) are the respective expectations of the tree estimator at x when the sequence of splits is such that v is the current subset of X containing x , with f (cid:48) beingcalculated conditional on X = x . It follows that f and f (cid:48) are functions of the distribution of its“input vector” X . In a slight abuse of notation, let Π and Π (cid:48) denote sequence of splits distributedaccording to the probabilities p and p (cid:48) . We will show that, for each k ∈ { , . . . , n − } , the totalvariation distance of( X | | X | = k, Π = v ) and ( X | | X | = k, Π (cid:48) = v, X = x ) (96)is bounded by (log s ) · ∆( s | v | ). This will suffice to bound | f − f (cid:48) | by the variational definition oftotal variation distance TV( p, p (cid:48) ) = sup | g |≤ | E A ∼ p g ( A ) − E A ∼ p (cid:48) g ( A ) | . (97)Since x / ∈ v , X is not an element of X , so that X and X are independent. Since the split Π (cid:48) isdistributed according to splitting probabilities when X = x , we have,dist( X | | X | = k, Π (cid:48) = v, X = x ) = dist( X | | X | = k, Π (cid:48) = v ) . (98)The depth of v is at most log s , so that P (Π = Π (cid:48) ) ≥ − (log s )∆( s | v | ) by applying the splittingstability assumption log s many times using a union bound. By (98) the total variation distance ofthe distributions in (96) differs by P (Π (cid:54) = Π (cid:48) ), and the result follows. Poor of Proposition 6. The idea is to recursively expand the formulas E T and E ( T | X ) in terms27f the directed acyclic graph. We start with | E T − E ( T | X = x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:88) v p ( e ) f ( e ) − (cid:88) v p (cid:48) ( e ) f (cid:48) ( e ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) v | p ( e ) − p (cid:48) ( e ) | f (cid:48) ( e ) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:88) v p (cid:48) ( e )( f ( e ) − f (cid:48) ( e )) (cid:12)(cid:12)(cid:12)(cid:12) , (99)where the sum runs over nodes v after the first split, i.e., [0 , p → v . The second summand maybe split into two, one over x ∈ v and the other x / ∈ v . If we assume splitting stability held withfunction ∆, then Proposition 5 allows us to bound the second term, so that | E T − E ( T | X = x ) | ≤ (cid:88) v | p ( e ) − p (cid:48) ( e ) | f (cid:48) ( e ) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:88) v p (cid:48) ( e )( f ( e ) − f (cid:48) ( e )) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ∆( αs ) + (log s )∆( αs ) + (cid:88) x ∈ v p (cid:48) ( e ) | f ( e ) − f (cid:48) ( e ) | , (100)where used the fact that | v | ≥ α . Now, each of the terms | f ( e ) − f (cid:48) ( e ) | may be bounded by∆( α s ) + log( s )∆( α s ) + (cid:80) x ∈ w ( · · · ). Continuing in this way, we have | E T − E ( T | X = x ) | ≤ log( s )(∆( s ) + p ∆( αs ) + p ∆( α s ) + . . . ) = log s ∞ (cid:88) (cid:96) =0 p (cid:96) ∆( α (cid:96) s ) , (101)where p (cid:96) is the probability that x and x (cid:96) belong to the same node after (cid:96) splits. In other words, ifwe let L be the number of splits after which x and x are separated, then p (cid:96) = P ( L ≥ (cid:96) ) . (102)Since x (cid:54) = x , we may assume without loss of generality that (cid:107) x − x (cid:107) > β for some fixed β (c.f.the proof of Proposition 3). In particular, p (cid:96) ≤ (1 − δ + o (1)) (cid:96) (103)for sufficiently large (cid:96) . Moreover, ∆( α l s ) = s (cid:15) ( α (cid:15) ) (cid:96) , whence δ > − α δ is enough to ensurethat infinite series is less than s (cid:15) . In particular, as s → 0, we may take (cid:15) → 0, so that therestriction is satisfied (after suitable constants) by δ > − α . This completes the proof. References [1] Leo Breiman. Random forests. Machine Learning , 2001.[2] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, andTie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of he 31st International Conference on Neural Information Processing Systems , NIPS’17, page3149–3157, Red Hook, NY, USA, 2017. Curran Associates Inc.[3] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings ofthe 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM.[4] Susan Athey, Julie Tibshirani, and Stefan Wager. The grf algorithm. https://github.com/grf-labs/grf/ , 2017.[5] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, andAndrey Gulin. Catboost: Unbiased boosting with categorical features. In Proceedings ofthe 32nd International Conference on Neural Information Processing Systems , NIPS’18, page6639–6649, Red Hook, NY, USA, 2018. Curran Associates Inc.[6] Baptiste Gregorutti, Bertrand Michel, and Philippe Saint-Pierre. Correlation and variableimportance in random forests. Statistics and Computing , 27(3):659–678, 2017.[7] Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis.Conditional variable importance for random forests. BMC bioinformatics , 9(1):307, 2008.[8] Robin Genuer, Jean-Michel Poggi, and Christine Tuleau-Malot. Variable selection using randomforests. Pattern recognition letters , 31(14):2225–2236, 2010.[9] Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomizedstudies. Journal of Educational Psychology , 66(5):688–701, 1974.[10] Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and BiomedicalSciences: An Introduction . Cambridge University Press, 2015.[11] Keisuke Hirano, Guido W. Imbens, and Geert Ridder. Efficient estimation of average treatmenteffects using the estimated propensity score. Econometrica , 71(4):1161–1189, 2003.[12] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effectsusing random forests. Journal of the American Statistical Association , 113(523):1228–1242,2018.[13] Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests. Ann. Statist. ,47(2):1148–1178, 04 2019.[14] Stefan Wager and Guenther Walther. Adaptive concentration of regression trees, with applica-tion to random forests. arXiv: Statistics Theory , 2015.[15] Nino Arsov, Martin Pavlovski, and Ljupco Kocarev. Stability of decision trees and logisticregression, 2019.[16] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Central limit theorems andbootstrap in high dimensions. Ann. Probab. , 45(4):2309–2352, 07 2017.[17] Xiaohui Chen. Gaussian and bootstrap approximations for high-dimensional u-statistics andtheir applications. Ann. Statist. , 46(2):642–678, 04 2018.[18] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning:data mining, inference, and prediction . Springer Science & Business Media, 2009.2919] Google and other contributors. TensorFlow: Large-scale machine learning on heterogeneoussystems, 2015. Software available from tensorflow.org.[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of MachineLearning Research , 12:2825–2830, 2011.[21] Rogier Quaedvlieg. Multi-horizon forecast comparison. Journal of Business & EconomicStatistics , pages 1–14, 2019.[22] A. W. van der Vaart. Asymptotic Statistics . Cambridge Series in Statistical and ProbabilisticMathematics. Cambridge University Press, 1998.[23] Patrick Billingsley.