[PDF] Generalised Boosted Forests

Abstract

This paper extends recent work on boosting random forests to model non-Gaussian responses. Given an exponential family \mathbb{E}[Y|X] = g^{-1}(f(X)) our goal is to obtain an estimate for f. We start with an MLE-type estimate in the link space and then define generalised residuals from it. We use these residuals and some corresponding weights to fit a base random forest and then repeat the same to obtain a boost random forest. We call the sum of these three estimators a \textit{generalised boosted forest}. We show with simulated and real data that both the random forest steps reduces test-set log-likelihood, which we treat as our primary metric. We also provide a variance estimator, which we can obtain with the same computational cost as the original estimate itself. Empirical experiments on real-world data and simulations demonstrate that the methods can effectively reduce bias, and that confidence interval coverage is conservative in the bulk of the covariate distribution.

Full PDF

GGeneralised Boosted Forests

Indrayudh Ghosal

[email protected]

Giles Hooker

[email protected]

Department of Statistics and Data Science, Cornell UniversityIthaca, NY 14850, USAMarch 4, 2021

Abstract

This paper extends recent work on boosting random forests to model non-Gaussian responses. Givenan exponential family E [ Y | X ] = g − ( f ( X )) our goal is to obtain an estimate for f . We start withan MLE-type estimate in the link space and then deﬁne generalised residuals from it. We use theseresiduals and some corresponding weights to ﬁt a base random forest and then repeat the same to obtaina boost random forest. We call the sum of these three estimators a generalised boosted forest . Weshow with simulated and real data that both the random forest steps reduces test-set log-likelihood,which we treat as our primary metric. We also provide a variance estimator, which we can obtain withthe same computational cost as the original estimate itself. Empirical experiments on real-world dataand simulations demonstrate that the methods can eﬀectively reduce bias, and that conﬁdence intervalcoverage is conservative in the bulk of the covariate distribution. This paper extends recent work on boosting random forests [Ghosal and Hooker, 2020] to model non-Gaussianresponses. Ensembles of trees have become one of the most successful general-purpose machine learningmethods, with the advantage of being both computationally eﬃcient and having few tuning parameters thatrequire human intervention [Breiman, 2001, Friedman, Hastie, and Tibshirani, 2001]. Ensembles are madeof multiple weak learners – most commonly based on tree-structured models – that are combined by one oftwo broad strategies: bagging or boosting. Bagging methods, in particular random forests [Breiman, 2001],combine trees that are learned in parallel using the same processes: usually diﬀering in the data samplesgiven to the trees and any randomisation processes used in obtain them. In contrast, boosting methods, suchgradient boosted forests [Friedman, 2001], obtain trees in sequence: each tree used to improve the predictionof the current ensemble.Ghosal and Hooker, 2020 proposed combining bagging and boosting by boosting a random forest estimator:a second random forest was built to predict the out of bag residuals of the ﬁrst random forest in a regressionsetting. This approach can be traced back to Breiman, 2001 and achieves a near-universal improvement intest-set error on real-world problems. A signiﬁcant motivation in Ghosal and Hooker, 2020 was to extendrecent results on uncertainty quantiﬁcation for random forests to boosting-type methods. Mentch andHooker, 2016 and Wager and Athey, 2017 provided conﬁdence intervals for the predictions of random forestsby employing a particular bagging structure, when each tree is given by a strict subsample of the dataset.This result is obtained by representing the ensemble as an inﬁnite order incomplete U -statistic for which acentral limit theorem can be obtained with the limiting variance calculated either by a direct representation[Mentch and Hooker, 2016] or the Inﬁnitesimal Jackknife [Wager and Athey, 2017], see also Zhou, Mentch,and Hooker, 2019. Ghosal and Hooker, 2020 extended this framework to represent the combination ofsequentially-built random forests as a sum of two jointly-normal U -statistics for which an extension of theInﬁnitesimal Jackknife could be applied.This paper builds on the ideas in Ghosal and Hooker, 2020 to the generalised regression setting where the1 a r X i v : . [ s t a t . M E ] M a r esponses follow an exponential family distribution. We use binary responses for classiﬁcation and countdata in which the response follows a Poisson distribution as motivating examples. In the general exponentialfamily we assume that the expected value of the response is related to a function of the predictor but onlythrough a suitable link function, i.e, E [ Y | X ] = g − ( f ( X )) . When restricted to a linear model f ( x ) = x (cid:62) β ,this follows the generalized linear model framework, but is replaced in this paper by the sum of sequentially-constructed random forests. We estimate this model via Newton-boosting for each random forest step, whichis essentially a Newton-Raphson update in gradient boosting [Friedman, 2001]. This type of updating hasbeen used in other iterative methods, for example word-embedding as in Baer, Seto, and Wells, 2018. Weﬁrst obtain an optimal constant, and then ﬁt two random forests in sequence, each time using a weightedﬁt to Newton residuals. This results in the same random forest process being used for each step; see §2 fordetails.We also extend the variance estimates derived in Ghosal and Hooker, 2020 to this generalized regressionmodel. This includes an extension of the Inﬁnitesimal Jackknife estimate of covariance to include accountingfor the initial constant; see §3.1. In §3.2 we discuss some theoretical properties of our algorithm, namely itsasymptotic normality and consistency of the variance estimate under suitable regularity conditions. Finallyin §4 we empirically demonstrate the utility of our algorithm - §4.1 focuses on results from a simulateddataset and §4.2 demonstrates the performance of our algorithm on some datasets from the UCI database(Lichman, 2013). In both cases we measure performance mainly in terms of test-set log likelihood, althoughother related metrics are also considered. We consider the generalised model E [ Y | X ] = g − ( f ( X )) where g is an appropriate link function; in thispaper we consider the speciﬁc cases of the logit function if Y is binary, or the log function if Y is count data.Given data ( Z i ) ni =1 = ( Y i , X i ) ni =1 our goal is to construct an estimate ˆ f for f . Note that in the generalizedlinear model literature [McCullagh and Nelder, 1989] the estimate ˆ f is from the family of linear functions butall of our calculations below can apply to any family of functions. Throughout this paper we will refer to theestimate ˆ f and its properties to be in the "link" space, and we will refer to g − ( ˆ f ) to be in the "response"space.We deﬁne the log-likelihood for the dataset ( Y i , X i ) ni =1 to be (cid:80) ni =1 (cid:96) ( η i ; Y i , X i ) , where η i = f ( X i ) is in thelink space. For notational simplicity we will refer to the likelihood function for one point to be simply (cid:96) i ( η i ) without explicit reference to the dependence on the data ( Y i , X i ) . To estimate ˆ f , ﬁrst we will obtain anMLE-type initial value ˆ η (0) MLE given by ˆ η (0) MLE = arg max t n (cid:88) i =1 (cid:96) i ( t ) To improve upon this estimate we will deﬁne "residuals" and then ﬁt a function ˆ f (1) with these residualsas responses. In Ghosal and Hooker, 2020 we used the Gaussian regression setting with the model being Y = f ( X ) + (cid:15) , where (cid:15) ∼ N (0 , σ ) . Then the residuals are given by r i = Y i − ˆ f ( X i ) , where ˆ f is someestimator of f . To generalise the deﬁnition of residuals to the non-Gaussian setting we will ﬁrst note therelationship between residuals and the log-likelihood in the Gaussian setting. The log-likelihood in this caseis given by (cid:80) ni =1 (cid:96) i ( ˆ f ( X i )) = C − (cid:80) ni =1 ( Y i − ˆ f ( X i )) σ , where C is a constant. Then we see that r i ∝ (cid:96) (cid:48) i ( ˆ f ( X i )) .Our goal now is to ﬁnd a function ˆ f (1) such that the log-likelihood has improved from its previous valuewhile also utilising some notion of "residuals" as related to the derivative of the log-likelihood. So we musttry to maximise (cid:80) ni =1 (cid:96) i ( η i ) , i.e., minimise (cid:80) ni =1 − (cid:96) i ( η i ) . Using a second order Taylor expansion on − (cid:96) i wenote that: n (cid:88) i =1 − (cid:96) i ( η i ) ≈ n (cid:88) i =1 (cid:34) − (cid:96) i (ˆ η (0) i ) − (cid:96) (cid:48) i (ˆ η (0) i ) · ( η i − ˆ η (0) i ) − (cid:96) (cid:48)(cid:48) i (ˆ η (0) i ) · ( η i − ˆ η (0) i ) (cid:35) n (cid:88) i =1 − (cid:96) (cid:48)(cid:48) i (ˆ η (0) i ) (cid:32) η i − ˆ η (0) i + (cid:96) (cid:48) i (ˆ η (0) i ) (cid:96) (cid:48)(cid:48) i (ˆ η (0) i ) (cid:33) + C, where C is some constant. We can then approximate ξ i = η i − ˆ η (0) i using a random forest trained withpredictors X i , training responses (cid:96) (cid:48) i (ˆ η (0) i ) / (cid:0) − (cid:96) (cid:48)(cid:48) i (ˆ η (0) i ) (cid:1) (the weighted residuals) and training weights − (cid:96) (cid:48)(cid:48) i (ˆ η (0) i ) .This sort of "weighted ﬁt" is akin a Newton-Raphson update. Denote this base random forest by ˆ f (1) . Herewe note that:• This calculation holds for any initial estimate ˆ f (0) ( X i ) = ˆ η (0) i but for the purposes of this paper theinitial estimate ˆ η (0) i = ˆ η (0) MLE is constant over i • There may be diﬀerent ways of implementing the training weights while ﬁtting a random forest. Thispaper employs the ranger package in R where weights are used as the sampling weights of the datapointsgiven to each tree. Alternative approaches might include utilising the weights in a weighted squarederror loss function to be used for the splitting criteria in construction of each tree. We will assumethat these two procedures (and any procedure using reasonable re-weighting) are equivalent.• In the Gaussian regression case the second derivatives (cid:96) (cid:48)(cid:48) i (ˆ η (0) i ) are constant. In the non-Gaussian casewe could also make the same assumption and simply ﬁt random forests with the residuals (cid:96) (cid:48) i (ˆ η (0) i ) asthe response (with weights assumed to be 1 without loss of generality). Initial experiments (not shown)suggest that this procedure does not perform better than Newton weighting.Once we have ˆ ξ (1) i = ˆ f (1) ( X i ) we can also ﬁt another random forest as a boosted step. Following similarcalculations as above we will use predictors X i , weighted residuals (cid:96) (cid:48) i (ˆ η (0) i + ˆ ξ (1) i ) / (cid:0) − (cid:96) (cid:48)(cid:48) i (ˆ η (0) i + ˆ ξ (1) i ) (cid:1) as thetraining response and training weights − (cid:96) (cid:48)(cid:48) i (ˆ η (0) i + ˆ ξ (1) i ) . Denote this boosted-stage random forest by ˆ f (2) .The ﬁnal predictor for a test point x will be given by ˆ f ( x ) = ˆ η (0) MLE + ˆ f (1) ( x ) + ˆ f (2) ( x ) in the link space and by g − ( ˆ f ( x )) in the response space. Similar to the Gaussian case discussed in Ghosaland Hooker, 2020 we could also continue this boosting for more than one step; however we expect diminishingreturns (and increasing computational burden) with further boosting steps as the size of the remaining signaldecreases relative to the intrinsic variance in the data. In this section we derive a variance estimate that can be calculated at no additional cost. We also show thatthe variance estimate is consistent under some regularity conditions and motivate a central limit theorem.Finally we end the section with an algorithmic description of the generalised boosted forest.

The Inﬁnitesimal Jackknife

We use the Inﬁnitesimal Jackknife (IJ) method to calculate the varianceestimates of each stage and also for the ﬁnal predictor (in the link space). In general Efron, 1982 deﬁnes theInﬁnitesimal Jackknife for an estimate of the form ˆ θ ( P ) where P is the uniform probability distributionover the empirical dataset. First we consider the slightly more general estimate ˆ θ ( P ∗ ) where P ∗ is someprobability distribution over the empirical dataset, i.e., it is a vector of the same length as the size of thedataset and has positive elements that add to 1. This is then approximated by the hyperplane tangent tothe surface ˆ θ ( P ∗ ) at the point P ∗ = P , i.e., ˆ θ ( P ∗ ) ≈ ˆ θ TAN ( P ∗ ) = ˆ θ ( P ) + ( P ∗ − P ) (cid:62) U , where U is a vectorof the directional derivatives given by U i = lim (cid:15) → ˆ θ ( P + (cid:15) ( δ i − P )) − ˆ θ ( P ) (cid:15) , i = 1 , . . . , n P ∗ − P we can obtain the variance for ˆ θ TAN ( P ∗ ) to be n (cid:80) ni =1 U i . This is the IJ variance estimate for the estimator ˆ θ ( P ) . In the link space

Borrowing the same notation as above suppose ˆ η (0) MLE = ˆ θ ( P ) , where P is the uniformprobability vector and its directional derivatives are given by U i . Then the IJ variance estimate of ˆ η (0) MLE willbe given by (cid:100) var IJ (ˆ η (0) MLE ) = n (cid:80) ni =1 U i . Wager, Hastie, and Efron, 2014 demonstrate that the directionalderivatives for random forests are given by U (cid:48) i = n · cov b ( N i,b , T b ( x )) where N i,b is the number of times the i th datapoint is used in training the b th tree, T b ( x ) is the predictionfor a testpoint x from the b th tree, and the covariance is calculated over all trees in the forest. Thus for ourpredictor ˆ f the IJ variance estimate is given by ˆ V ( x ) = 1 n n (cid:88) i =1 (cid:16) U i + n · cov b ( N (1) i,b , T (1) b ( x )) + n · cov b ( N (2) i,b , T (2) b ( x )) (cid:17) + 1 B (cid:0) var b ( T (1) b ( x )) + var b ( T (2) b ( x )) (cid:1) (3.1)where the second term accounts for the extra variability from the random forest being an incomplete U-statistic rather than a complete U-statistic [Ghosal and Hooker, 2020]. In appendix A we show that thisvariance estimate is consistent under some regularity assumptions detailed in §3.2. We would especially liketo note one of the cross terms in the ﬁrst term of the above expression given by n (cid:80) ni =1 U i · cov b ( N (1) i,b , T (1) b ( x )) which is the IJ estimate of the covariance between the constant ( ˆ η (0) MLE ) and the base stage random forest( ˆ f (1) ). Similar cross-product terms will be estimates of co-variance between ˆ η (0) MLE (and ˆ f (1) ) with theboosted stage random forest ˆ f (2) . Bias correction

The IJ variance estimate is known to have an upward bias, and so we will employ acorrection term before further exploration. Our bias correction is inspired by Zhou, Mentch, and Hooker,2019, namely the ﬁnal formula for the variance estimate of a U-statistic in appendix E of that paper. Althoughnote that we don’t need to use the multiplicative factor because we already account for the variance betweenthe trees in the second term of our variance estimate above (3.1). Our ﬁnal variance estimate is given by ˆ V ( x ) = 1 n n (cid:88) i =1 (cid:16) U i + n · cov b ( N (1) i,b , T (1) b ( x )) + n · cov b ( N (2) i,b , T (2) b ( x )) (cid:17) + 1 B · (cid:16) − nk (cid:17) · (cid:0) var b ( T (1) b ( x )) + var b ( T (2) b ( x )) (cid:1) (3.2)where k is the size of the subsamples (without replacement) for each tree in the random forests ˆ f (1) and ˆ f (2) . Examples

The closed form of U i will depend on the type of response being used. For example we cancalculate U i for two cases below:• For the binomial response family ( g − is the logit function) our model is Y i ∼ binomial ( n i , p i ) where p i = g − ( f ( X i )) and n i are given positive integers. So we estimate ˆ η (0) i as follows ( C is a constant) (cid:96) i ( t ) = y i log (cid:18) e t e t (cid:19) + ( n i − y i ) log (cid:18)

11 + e t (cid:19) + C = y i ( t − log(1 + e t )) − ( n i − y i ) log(1 + e t ) + C = y i t − n i log(1 + e t ) + C = ⇒ (cid:96) (cid:48) i ( t ) = y i − n i · e t e t n (cid:88) i =1 (cid:96) (cid:48) i ( t ) = 0 = ⇒ n (cid:88) i =1 y i − e t e t · n (cid:88) i =1 n i = 0 = ⇒ ˆ η (0) i = t = log (cid:18) (cid:80) ni =1 y i (cid:80) ni =1 ( n i − y i ) (cid:19) . Thus ˆ η (0) i = ˆ θ ( P ) where P is the uniform probability vector and ˆ θ ( P ) = log (cid:16) Y (cid:62) P ( N − Y ) (cid:62) P (cid:17) for a generalprobability vector P . So U i = lim (cid:15) → ˆ θ ( P + (cid:15) ( δ i − P )) − ˆ θ ( P ) (cid:15) = lim (cid:15) → log( Y (cid:62) ( P + (cid:15) ( δ i − P ))) + log(( N − Y ) (cid:62) ( P + (cid:15) ( δ i − P ))) (cid:15) − log( Y (cid:62) P ) + log(( N − Y ) (cid:62) P ) (cid:15) = lim (cid:15) → log( Y (cid:62) P + (cid:15) · Y (cid:62) ( δ i − P ))) − log( Y (cid:62) P ) (cid:15) − log(( N − Y ) (cid:62) P + (cid:15) · ( N − Y ) (cid:62) ( δ i − P ))) − log(( N − Y ) (cid:62) P ) (cid:15) = Y (cid:62) ( δ i − P ) Y (cid:62) P − ( N − Y ) (cid:62) ( δ i − P )( N − Y ) (cid:62) P = y i − ¯ y ¯ y − ( n i − y i ) − (¯ n − ¯ y )¯ n − ¯ y = ¯ ny i − n i ¯ y ¯ y (¯ n − ¯ y ) • In case where Y is count data and we use a Poisson response family with our model being Y i ∼ P oisson ( λ i ) , where λ i = exp( f ( X i )) and we get the following calculations for estimating ˆ η (0) i ( C is aconstant) (cid:96) i ( t ) = − e t + y i log( e t ) + C = y i t − e t + C = ⇒ (cid:96) (cid:48) i ( t ) = y i − e t Then n (cid:88) i =1 (cid:96) (cid:48) i ( t ) = 0 = ⇒ n (cid:88) i =1 ( y i − e t ) = 0 = ⇒ ˆ η (0) i = t = log(¯ y ) So we have ˆ η (0) i = ˆ θ ( P ) with ˆ θ ( P ) = log( Y (cid:62) P ) and thus U i = lim (cid:15) → ˆ θ ( P + (cid:15) ( δ i − P )) − ˆ θ ( P ) (cid:15) = lim (cid:15) → log( Y (cid:62) ( P + (cid:15) ( δ i − P ))) − log( Y (cid:62) P ) (cid:15) = lim (cid:15) → log( Y (cid:62) P + (cid:15) · Y (cid:62) ( δ i − P ))) − log( Y (cid:62) P ) (cid:15) = Y (cid:62) ( δ i − P ) Y (cid:62) P = y i − ¯ y ¯ y In the response space

Note that in the response space (by transformation with g − ) we can use thedelta-method to get the corresponding variance estimates. Using the delta method is valid under decreasingvar ( ˆ f ) . Thus for a testpoint x if ˆ f ( x ) is the prediction and ˆ V ( x ) is the IJ variance estimate in the link space[(3.2)] then in the response space the variance estimate will be ˆ V ( x ) · (cid:0) ( g − ) (cid:48) ( ˆ f ( x )) (cid:1) . We will now brieﬂy explore some theoretical properties of our estimator and the variance estimate.5

Regularity Condition

First we need a regularity condition very similar to Condition 1 in Ghosal andHooker, 2020. Note that the residuals (cid:96) (cid:48) i (ˆ η (0) i ) / (cid:0) − (cid:96) (cid:48)(cid:48) i (ˆ η (0) i ) (cid:1) is a function of ˆ η (0) i = ˆ η (0) MLE and thus dependson the whole dataset. Thus each of the trees in the random forest ˆ f (1) depends on the whole dataset aswell and so it’s not a valid U-statistic kernel. This makes ˆ f (1) not a valid U-statistic. A similar argumentholds for ˆ f (2) . However, if we instead deﬁned the random forests based on “noise free” residuals such as (cid:96) (cid:48) i ( E [ˆ η (0) i ]) / (cid:0) − (cid:96) (cid:48)(cid:48) i ( E [ˆ η (0) i ]) (cid:1) then the base random forest would be a U-statistic (to make the second (boost)random forest a U-statistic as well we need a slightly more modiﬁed version of η (1) i ). We should ﬁrst makethe following general assumption showing asymptotic equivalence of these two types of random forests. Condition 1.

For each of j = , , deﬁne ˆ f ( j ) to be the predictor obtained when using the residuals (cid:96) (cid:48) i (ˆ η ( j − i ) / (cid:0) − (cid:96) (cid:48)(cid:48) i (ˆ η ( j − i ) (cid:1) and weights − (cid:96) (cid:48)(cid:48) i (ˆ η ( j − i ) . We assume there are ﬁxed functions g ( j − such that for ˇ η ( j − i = g ( j − ( x i ) , the forest ˇ f ( j ) ( x ) obtained when using residuals (cid:96) (cid:48) i (ˇ η ( j − ) / (cid:0) − (cid:96) (cid:48)(cid:48) i (ˇ η ( j − i ) (cid:1) and weights − (cid:96) (cid:48)(cid:48) i (ˇ η ( j − i ) follows ˆ f ( j ) ( x ) − ˇ f ( j ) ( x ) (cid:113) var ( ˆ f ( j ) ( x )) p −→ . This condition is slightly more general than we need. A natural choice for the functions g ( j − could be g (0) ( x ) = E (cid:104) ˆ η (0) MLE (cid:105) and g (1) ( x ) = E (cid:104) ˆ η (0) MLE + ˆ f (1) ( x ) (cid:105) . However, g (1) ( x ) = E (cid:104) ˆ η (0) MLE + ˇ f (1) ( x ) (cid:105) could alsobe used so that ˇ η (0) MLE + ˇ f (1) ( x ) + ˇ f (2) ( x ) represents a boosting analogue that is exactly a sum of U -statistics.Our main requirement is that the generalised boosted forest be approximated to within its standard errorby functions that represent a sum of U statistics.In the rest of this section we will state and prove our theoretical results for the estimator ˇ f ( x ) = ˆ η (0) MLE +ˇ f (1) ( x ) + ˇ f (2) ( x ) . A discussion regarding this condition can be found in § of Ghosal and Hooker, 2020. Asymptotic Normality and Variance Consistency

Suppose the random forests ˆ f (1) and ˆ f (2) bothhave B n trees and each tree T ( x ; Z I ) is constructed with a subsample I of size k n (without replacement).Deﬁne the variance of this tree kernel to be ζ k n ,k n and we also know that the variance of the completeU-statistic with T as the kernel is given by k n n ζ ,k n [Hoeﬀding, 1948; Lee, 1990]. Without loss of generalitywe can also assume E T ( x ; Z , . . . , Z k n ) = 0 .Finally we assume the following Lindeberg-Feller type condition [due to Mentch and Hooker, 2016] Condition 2.

Suppose that the dataset Z , Z , . . . iid ∼ D Z and let T ( Z , . . . , Z k n ) be the tree kernel of arandom forest based on a subsample of size k n . Deﬁne T ,k n ( z ) = E T ( z, Z , . . . , Z k n ) . We assume that forall δ > n →∞ ζ ,k n E (cid:2) T ,k n ( Z ) {| T ,k n ( Z ) | > δζ ,k n } (cid:3) = 0 Then under the conditions 1 and 2 we can prove the following result

Theorem 1.

Assume that the dataset Z , Z , . . . iid ∼ D Z and that E T ( x ; Z , . . . , Z k n ) ≤ C < ∞ for all x, n and some constant C . If k n , B n → ∞ such that k n n → and nB n → as n → ∞ as well as lim n →∞ k n ζ ,kn ζ kn,kn (cid:54) = 0 then we have ˆ f ( x ) σ n ( x ) D → N (0 , for some sequence σ n ( x ) . Further σ n ( x ) is consistently estimated by ˆ V ( x ) as given by (3.1) . Our assumptions and conditions for this theorem is almost exactly the same as Theorem 1 in Ghosal andHooker, 2020 and so we will refer to that paper for the proof of asymptotic normality. The proof for thevariance consistency part can be found in appendix A.6lso note that the above result is for the link space estimate and variance estimate. But we can easilyobserve that σ n ( x ) → and so we can use an appropriate second-order Taylor expansion and a delta-methodtype argument to prove asymptotic normality of the response space estimate (mentioned at the end of §2)and consistency of the response space variance estimate (mentioned at the end of §3.1).7 .3 Algorithm Algorithm 1:

Generalised Boosted Forest ( and its variance estimate ) Input :

The data (cid:0) Z i = ( Y i , X i ) (cid:1) ni =1 , the link function g − , the log-likelihood function (cid:96) , thetree function T , the number of trees in both forests B , the subsample size for each tree k , and the test point x . Obtain :

The MLE-type constant ˆ η (0) MLE = arg max t (cid:80) ni =1 (cid:96) ( t ; Z i ) . This may have a closed-formor may be done numerically. Calculate:

The directional derivatives of the MLE-type constant U (0) i .Calculate weighted residuals r (1) i = (cid:96) (cid:48) i (ˆ η (0) MLE ) − (cid:96) (cid:48)(cid:48) i (ˆ η (0) MLE ) and training weights w (1) i = − (cid:96) (cid:48)(cid:48) i (ˆ η (0) MLE ) for b = 1 to B do Choose a subset of size k from [ n ] such that the probability of the i th datapoint being included isproportional to w (1) i . Calculate: T (1) b as the tree with the training data r (1) I and N (1) i,b = { i ∈ I } . endObtain : The ﬁrst (base) random forest ˆ f (1) ( x ) = B (cid:80) Bb =1 T (1) b ( x ) . Calculate:

The directional derivatives of ˆ f (1) ( x ) as U (1) i ( x ) = n · cov b ( N (1) i,b , T (1) b ( x )) .Calculate weighted residuals r (2) i = (cid:96) (cid:48) i (ˆ η (0) MLE +ˆ η (1) i ) − (cid:96) (cid:48)(cid:48) i (ˆ η (0) MLE +ˆ η (1) i ) and training weights w (1) i = − (cid:96) (cid:48)(cid:48) i (ˆ η (0) MLE + ˆ η (1) i ) ,where ˆ η (1) i = ˆ f (1) ( X i ) . for b = 1 to B do Choose a subset of size k from [ n ] such that the probability of the i th datapoint being included isproportional to w (2) i . Calculate: T (2) b as the tree with the training data r (2) I and N (2) i,b = { i ∈ I } . endObtain : The second (boosted) random forest ˆ f (2) ( x ) = B (cid:80) Bb =1 T (2) b ( x ) . Calculate:

The directional derivatives of ˆ f (2) ( x ) as U (2) i ( x ) = n · cov b ( N (2) i,b , T (2) b ( x )) . Output :

The generalised boosted forest estimate in the link space at the test point x given by ˆ f ( x ) = ˆ η (0) MLE + ˆ f (1) ( x ) + ˆ f (2) ( x ) and its (bias-corrected) variance estimate given by ˆ V ( x ) = n (cid:80) ni =1 (cid:16) U (0) i + U (1) i + U (2) i (cid:17) + B · (cid:0) − nk (cid:1) · (cid:0) var b ( T (1) b ( x )) + var b ( T (2) b ( x )) (cid:1) . Output :

The generalised boosted forest estimate in the response space at the test point x givenby g − ( ˆ f ( x )) and its variance estimate given by ˆ V ( x ) · (cid:0) g − ( ˆ f ( x )) (cid:1) . We use the random forest implementation in the ranger package in R [Wright and Ziegler, 2017]. In thispackage the weights (the second derivatives of the likelihood function) are used as probabilities for thecorresponding datapoints when selected in the training set of a particular tree. We use a signal given by f ( X ) = (cid:80) i =1 X i , where X ∼ U ([ − , ) .Our simulation runs for 200 replicates - in each of them we generate a dataset of size n = 1000 and train ageneralised boosted forest with varying levels of subsample sizes (20%, 40%, 60% or 80% of the size of thetraining data) and B = 1000 trees for each of the two random forests. For each replication we also use aseparate ﬁxed set of 100 test points from U ([ − , ) and then append 5 more ﬁxed points to them, which8re given by p = , p = (cid:18) , (cid:62) (cid:19) (cid:62) , p = 13 √ ∗ , p = 2 p , p = 3 p Experiments in Ghosal and Hooker, 2020 indicate that the signal to noise ratio aﬀects the performance ofboosted forests (referred to as the Gaussian case above) since if the noise in the dataset increase so doesthat in the estimate and the beneﬁt from boosting decreases. The same is true for the generalised boostedforest. We have therefore experimented with diﬀerent combinations of signal-to-noise ratios as given below.•

Binomial family:

One way to improve signal was to have the success probabilities far away fromeach other (since the noise is given by p (1 − p ) and is minimised for p = 0 , ). For this purpose wemultiplied the signal f ( X ) by a scaling factor s ∈ { , , , , } before applying the logistic function totransform into probabilities. Thus the probabilities are pushed towards 0 and 1 as the scale increases.We can also achieve higher signal by having a higher number of trials ( n i ) - for this purpose we generated n i uniformly from the set [ M ] = { , , . . . , M } where M was varied to be in the set { , , , , } . Oncewe have the values of n i , we generate Y i ∼ Binomial ( n i , p i ) where p i = g − ( s · f ( X i )) for an appropriatescale s , and g − being the logistic function.• Poisson family: If Y i ∼ P oisson ( λ i ) then var ( E ( Y i )) /mean ( var ( Y i )) = var ( λ i ) /mean ( λ i ) is thesignal-to-noise ratio. So if we multiply all λ i values with a scaling factor then the signal-to-noise ratioincreases. Thus we set Y i ∼ P oisson ( λ i ) , where λ i = g − ( f ( X i ) + log( s )) = s · g − ( f ( X i )) , where s ∈ { , , , , } and g − is the exponential function.For each of these settings we obtain predictions and a variance estimates in the link space, which we can alsotransform into their corresponding values in the response space. We use the following metrics to measureour performance1. The improvement in test-set log-likelihood from the MLE-type stage to ﬁtting the base forest, andalso from ﬁtting the base forest to the boosted stage. In the leftmost panel in Figure 1 we show thebinomial case only for the value of M = 4 since all the other number of trials have a very similar shapeto the plot (when suitably scaled). Note that the y-axis of the plots are in the pseudo log scale (inversehyperbolic sin)2. We can deﬁne MSE in two ways - by n (cid:80) ( ˆ f ( X i ) − f ( X i )) in the link space and by n (cid:80) ( g − ( ˆ f ( X i )) − g − ( f ( X i ))) in the response space. Consider the three diﬀerent estimators ˆ f corresponding to theMLE-type, base random forest and ﬁnal boosted estimator and the improvement in MSE that can becalculated between these stages in the same way as for log-likelihood above. In the second and thirddiagrams in Figure 1 we’ve also restricted the binomial family to the case M = 4 (the other values of M show plots with similar shape and scale). Here too the y-axis of the plots are in the pseudo logscale (inverse hyperbolic sin).Note that each point in these plots are a summary of 200 numbers and for the Poisson family with highsubsample fractions and high scales an unusual situation may arise. In this case the response spaceMSE may increase between the base and ﬁrst boosting stage - although this doesn’t always happen,the amount of increment has a very heavy right tail, and the base to boost stage almost always correctsfor this increase in MSE. For this reason, we show the median MSE improvement in this plot ratherthan the mean. Note that this is in spite of the link space showing decreases in MSE between thediﬀerent stages for all cases - this is be because the response space estimate is an exponentiation of thelink space estimate and has too much variability, i.e, outliers play a larger role in skewing MSE. Wehave included a more detailed explanation for this phenomenon in appendix C.3. Conﬁdence interval coverage as the average number of times (out of the 200 replicates) the real signalis contain within the 95% conﬁdence interval. We plot the coverage as a function of the true valueof the signal (shown in the pseudo-log scale) in Figure 2. We restrict ourselves to the link space, afew representative families and scales, and a sample fraction of 0.4 (since we’ve observed that coveragedoesn’t really depend on the sample fraction). 9igure 1: Improvements in loglikelihood and MSE (link and response spaces) in the pseudo-log scale b i no m i a l _04po i ss on sample fraction Likelihood improvement b i no m i a l _04po i ss on sample fraction Mean MSE improvement in the link space b i no m i a l _04po i ss on sample fraction Median MSE improvement in the response space scale

01 02 04 08 16 stage difference

Base RF − MLE Boost − Base RF

We see that the coverage always improves with the second (boosted) random forest. For the binomialfamily there seems to be a threshold beyond which the conﬁdence interval doesn’t cover the true signalat all and this threshold is independent of both ways of trying to improve the signal-to-noise ratio.A similar phenomenon is apparent for the Poisson family although here the threshold depends on thescale. Possible reasons for this behaviour are discussed in appendix B.More detailed plots and discussion using further metrics can be found in appendix D.1.For the 5 ﬁxed points p , . . . , p we can also ﬁnd improvements in expected values of log-likelihood and MSEbut the eﬀects won’t be very pronounced for individual points. Instead we will look at the following metrics1. Absolute bias averaged over the 200 replicates. In Figure 3 we again focus only on M = 4 for thebinomial family (the other values of M show plots with similar shape and scale). Note that for similarreasons as MSE above, the average absolute bias can be calculated in the link (top two) and response(bottom two) spaces.We can see that in the link space and for the binomial family (top left of Figure 3), the absolute biasdoesn’t depend on the subsample fraction but increases with scale and the distance of the test pointfrom the origin. For the other three cases relationship with scale and distance from origin is similaralthough increasing subsample fractions generally decreases the absolute bias. Most importantly in allcases we can see that boosting reduces absolute bias.2. Consistency of the variance estimate (bias corrected) calculated by taking the ratio of average of the binomial_01 binomial_04 binomial_16 poisson sc a l e : sc a l e : sc a l e : truth c o v e r age stage Base RF Boost RF variance estimates over the variance of the estimates (over the 200 replicates). In Figure 4 we restrictthe binomial family to M = 4 since the other values of M have plots with similar shape and scale.This plot is only for the link space although the plot for the response space is very similar.Here we see that the variance is more consistent for higher subsample fraction. For the binomial familyvariance consistency is slightly better for points nearer to the origin although the opposite eﬀect isobserved for the Poisson family. The eﬀect of scale on variance consistency is diﬃcult to ascertain.Lastly the eﬀect of the boosting step on this metric seems to depend on the test-point and the family- for test-points farther from the origin this metric becomes better with boosting for the binomial, butworse for the Poisson family.3. Kolmogorov-Smirnov statistic for normality. This can also be calculated for the link and responsespaces (i.e., before and after the delta transformation). The values for the K-S statistic in the linkspace fall in the range of 0.1-0.3 - it doesn’t change too much for the diﬀerent points, but it increasesslightly with subsample fraction and scales s , it is also slightly lower for the boosted stage than thebase forest stage. A detailed plot for this metric can be found in appendix D.1.See appendix D.1 for more detailed plots and discussions with other performance metrics.11igure 3: Absolute bias for the 5 ﬁxed points - link space on top, response space on bottom binomial_04 p1p3p5 sample fraction poisson p1p3p5 sample fraction binomial_04 p1p3p5 sample fraction poisson p1p3p5 sample fractionscale

01 02 04 08 16 stage

Base RF Boost RF binomial_04 p1p3p5 sample fraction poisson p1p3p5 sample fractionscale

01 02 04 08 16 stage

Base RF Boost RF

We present below results for some real datasets from the UCI database. Given the data ( Y i , X i ) ni =1 weemployed a 10-fold cross-validation to get predictions ˆ f ( X i ) and variance estimates ˆ V i in the link space forall three stages. For performance metrics we use the log-likelihood (deﬁned similar to above) MSE, averagevariance and prediction intervals (in the response space). The last three are given byMSE = 1 n n (cid:88) i =1 ( Y i − g − ( ˆ f ( X i ))) , Avg Var = 1 n n (cid:88) i =1 ˆ V ( R ) i = 1 n n (cid:88) i =1 ˆ V i · (cid:0) g − ( ˆ f ( X i )) (cid:1) , prediction interval for Y i , i.e. , P i = (cid:18) g − ( ˆ f ( X i )) ± Φ − (0 . (cid:113) ˆ V ( R ) i + ˆ V ( R ) e (cid:19) , where g − is the appropriate link function and V ( R ) e is the error variance (in the response space) estimatedby ˆ V ( R ) e = MSE. Once we have the prediction intervals we calculate the prediction coverage (PC) to be (cid:80) ni =1 { Y i ∈ P i } .We’ve used the Abalone and

Solar ﬂare datasets to ﬁt the Poisson family, with results reported in Table 1.We can see that the boosting stages improve all the measures for the

Abalone data (except for the averagevariance). But in the

Solar ﬂare data the response takes the value 0 a lot of the time and only occasionallyhas values between 1 and 8 - that is why the ﬁrst stage random forest fails to improve log-likelihood butfurther boosting helps in this case. 13able 1: Performance statistics for the

Abalone and

Solar Flare datasets.

Abalone

MSE Avg Var PC LLstage0 10.413 0.003 0.9494 12.8729stage1 4.688 3.583 0.9825 13.1637stage2 4.549 5.848 0.9882 13.1745

Solar ﬂare

MSE Avg Var PC LLstage0 0.703 0.001 0.9340 -0.6659stage1 5.195 20.312 0.9972 -0.8382stage2 1.768 6.661 0.9849 -0.6526We’ve also used the

Spam dataset for the binomial (Bernoulli) family where our results are given in Table 2.Here boosting improves both MSE and the log-likelihood. It also makes the coverage for a 95% predictionTable 2: Performance statistics for the

Spam dataset.

Spam

MSE Avg Var PC LLstage0 0.239 0.00006 1.0000 -0.6709stage1 0.059 0.02725 0.9813 -0.2484stage2 0.040 0.03162 0.9837 -0.1648interval less conservative. Further note that the log-likelihood using the probability predictions from the ranger package in R is found to be -0.1957 which is less that the ﬁnal log-likelihood given by our method,i.e, our method ﬁts the dataset better than traditional classiﬁcation forests in the log-likelihood metric. We demonstrated a new method,

Generalised Boosted Forest , to ﬁt responses which can be modelled by anexponential family. With the goal of maximising the training set log-likelihood we start with an MLE-typeestimator in the link space (as opposed to the actual MLE which is in the response space). Then with asecond order Taylor series approximation to the log-likelihood we showed that a random forest, ﬁtted withpseudo-residuals and appropriate weights, can improve this log-likelihood. Then, similar to Ghosal andHooker, 2020, we add another random forest as a boosting step which further improves the log-likelihood.Our method also uses the Inﬁnitesimal Jackknife to get variance estimates, at no additional computationalcost.Through simulations and real data examples we’ve shown the eﬀectiveness of our algorithm. The randomforests always improve the test-set log-likelihood in all of the cases we’ve tested. We also see noticeableimprovement in test-set MSE. Our estimates for uncertainty quantiﬁcation shows conservative coverage ofthe resulting conﬁdence/prediction intervals and the near-Gaussianity of the distribution of predictions.Although outside the scope of this paper, the method could be improved in many ways. We could of coursetry more boosting steps; the boosting steps could also have a multiplier determined either by hyper-parametertuning or with Newton-Raphson. Also note that we have not used over-dispersed models for variance, butobtaining an overdispersion parameter may be important in providing appropriate conﬁdence intervals insome cases. We may also wish to consider alternative base-stage models such as a generalised linear modelrather than a straightforward constant. In this case, performance improvement from boosting this modelwith random forests can also be thought of as a goodness of ﬁt test for the base model.

Acknowledgements

We are indebted to Benjamin Baer for the idea of Newton boosting updates.14 eferences [Hoe48] Wassily Hoeﬀding. “A class of statistics with asymptotically normal distribution”. In:

The annalsof mathematical statistics (1948), pp. 293–325.[Efr82] Bradley Efron.

The jackknife, the bootstrap and other resampling plans . SIAM, 1982.[MN89] P. McCullagh and J.A. Nelder.

Generalized Linear Models, Second Edition . Chapman & Hall,1989.[Lee90] Alan J Lee.

U-statistics. Theory and practice. Statistics: Textbooks and Monographs, 110 . 1990.[Bre01] Leo Breiman. “Random forests”. In:

Machine learning

The elements of statistical learning .Vol. 1. Springer series in statistics New York, 2001.[Fri01] Jerome H Friedman. “Greedy function approximation: a gradient boosting machine”. In:

Annalsof statistics (2001), pp. 1189–1232.[Lic13] Moshe Lichman.

UCI machine learning repository . 2013. url : http://archive.ics.uci.edu/ml .[WHE14] Stefan Wager, Trevor Hastie, and Bradley Efron. “Conﬁdence intervals for random forests: thejackknife and the inﬁnitesimal jackknife.” In: Journal of Machine Learning Research

The Journal of Machine Learning Research

Journal of the American Statistical Association just-accepted (2017).[WZ17] Marvin N. Wright and Andreas Ziegler. “ranger: A Fast Implementation of Random Forests forHigh Dimensional Data in C++ and R”. In:

Journal of Statistical Software doi : .[BSW18] Benjamin R Baer, Skyler Seto, and Martin T Wells. “Exponential family word embeddings: Aniterative approach for learning word vectors”. In: NIPS Workshop: IRASL . 2018.[ZMH19] Zhengze Zhou, Lucas Mentch, and Giles Hooker. “V-statistics and Variance Estimation”. In: arXivpreprint arXiv:1912.01089 (2019).[GH20] Indrayudh Ghosal and Giles Hooker. “Boosting random forests to reduce bias; one-step boostedforest and its variance estimate”. In:

Journal of Computational and Graphical Statistics (2020),pp. 1–10. 15

Consistency of Inﬁnitesimal Jackknife

As stated in Theorem 1 we will prove consistency of the variance estimate in (3.1).

Proof:

Suppose the MLE-type estimator in the signal space is given by f (¯ y ) where f has a continuousderivative. By the classical central limit theorem ¯ y ∼ N ( µ, σ n ) and so by the delta method f (¯ y ) ∼ N ( f ( µ ) , σ n · ( f (cid:48) ( µ )) ) approximately. Also it is easily seen that the IJ directional derivatives for f (¯ y ) are ( U i ) ni =1 where U i = f (cid:48) (¯ y )( y i − ¯ y ) . So the IJ estimate for the variance of f (¯ y ) is n n (cid:88) i =1 U i = 1 n n (cid:88) i =1 ( f (cid:48) (¯ y )) · ( y i − ¯ y ) = s n · ( f (cid:48) (¯ y )) Since ¯ y and s are asymptotically independent hence E (cid:104) s n · ( f (cid:48) (¯ y )) (cid:105) = σ n · ( f (cid:48) ( µ )) .Now note that the IJ variance estimate for the two random forests and also the IJ covariance estimate betweenthem has been shown to be consistent in Ghosal and Hooker, 2020. We only need to show consistency ofthe IJ covariance estimate between the MLE-type estimator f (¯ y ) and the random forests. The followingprocedure shows this for the base random forest but it will be a similar process for the boosting step aswell - note that by Condition 1 the noise from f (¯ y ) doesn’t aﬀect the base random forest or it’s directionalderivatives.We know from Ghosal and Hooker, 2020 that the IJ directional derivatives for a random forest is given by U (cid:48) i = n · cov b ( N i,b , T b ( x )) where N i,b is the number of times the i th training point is included in the samplefor the b th tree and T b is the prediction from the b th tree for the test point x . The IJ variance estimate ˆ σ F = n ( U (cid:48) i ) is consistent for σ F , the theoretical variance of the ﬁrst Hajek projection of the random forest ˆ F ( x ) , i.e, σ F = n (cid:88) i =1 (cid:104) E U ∼ D ( ˆ F ( x ) | U = Z i ) − E U ∼ D ( ˆ F ( x )) (cid:105) = k n n (cid:88) i =1 [ E U ∼ D ( T ( x ) | U = Z i ) − E U ∼ D ( T ( x ))] , where ( Z i ) ni =1 = ( Y i , X i ) ni =1 is the training data, T is the tree kernel function for the U-statistic ˆ F , U is thetraining set (of size k ) of a tree T , and D is the distribution from which the training data is drawn. So weactually deﬁne σ F = k n n (cid:88) i =1 (cid:2) E U ∼ ˆ D ( T ( x ) | U = Z i ) − E U ∼ ˆ D ( T ( x )) (cid:3) = k n n (cid:88) i =1 ( A i + R i ) (say) , where A i = E U ∼ ˆ D (˚ T ( x ) | U = Z i ) − E U ∼ ˆ D (˚ T ( x )) with ˚ T = (cid:80) ki =1 T being the ﬁrst Hajek projection of T .Then from Wager and Athey, 2017 E (cid:104) k n (cid:80) ni =1 A i (cid:105) = σ F asymptotically and σ F k n (cid:80) ni =1 R i p −→ .For the covariance we need to show that kn (cid:80) ni =1 A i U i is consistent for cov ( f (¯ y ) , ˆ F ( x )) . Now note that byTaylor expansion f (¯ y ) = f ( µ ) + (¯ y − µ ) f (cid:48) ( y ∗ ) where | y ∗ − µ | < | ¯ y − µ | . But ¯ y → µ = ⇒ y ∗ → µ = ⇒ f (cid:48) ( y ∗ ) → f (cid:48) ( µ ) by the Strong Law of Large Numbers and by continuity of f (cid:48) . Also from Wager and Athey,2017 we know that the ﬁrst Hajek projection of ˆ F ( x ) is E [ T ] + kn (cid:80) ni =1 T ( Z i ) . So cov ( f (¯ y ) , ˆ F ( x )) ≈ cov (cid:32) f (cid:48) ( µ ) (cid:32) n n (cid:88) i =1 y i − µ (cid:33) , E [ T ] + kn n (cid:88) i =1 T ( Z i ) (cid:33) = f (cid:48) ( µ ) · cov (cid:32) n n (cid:88) i =1 y i , kn n (cid:88) i =1 T ( Z i ) (cid:33) = f (cid:48) ( µ ) · n · kn · n · cov ( y n , T ( Z n )) kn · f (cid:48) ( µ ) · cov ( y n , T ( Z n )) As a sidenote we observe here that the term k/n → as per the assumptions in Theorem 1 and so we couldactually ignore this covariance term (and it’s estimate) entirely in the ﬁnal variance estimate. But we willstill go ahead and show consistency for the covariance estimate deﬁned above.Suppose U i = B i + S i where B i = f (cid:48) ( µ )( y i − µ ) . We also know from Wager and Athey, 2017 that A i = n − kn (cid:104) T ( Z i ) − n − (cid:80) j (cid:54) = i T ( Z j ) (cid:105) . Then E [ A i B i ] = n − kn · E ( T ( Z i )) · f (cid:48) ( µ )( y i − µ ) = f (cid:48) ( µ ) · n − kn · cov ( T ( Z i ) , y i − µ ) kn n (cid:88) i =1 E [ A i B i ] = kn · n · f (cid:48) ( µ ) · n − kn · cov ( T ( Z n ) , y n ) = n − kn · cov ( f (¯ y ) , ˆ F ( x )) Also note that S i = f (cid:48) (¯ y )( y i − ¯ y ) − f (cid:48) ( µ )( y i − µ ) = y i ( f (cid:48) (¯ y ) − f (cid:48) ( µ )) − (¯ yf (cid:48) (¯ y ) − µf (cid:48) ( µ )) = y i p n − q n (say).Now by the strong law of large numbers ¯ y → µ , n (cid:80) ni =1 y i = s + ¯ y → σ + µ and also by continuity p n , q n → . Thus n n (cid:88) i =1 S i = 1 n n (cid:88) i =1 ( y i p n + q n − y i p n q n )= p n · n · (cid:32) n n (cid:88) i =1 y i (cid:33) + 1 n · q n − yp n q n → Hence by Cauchy-Schwartz kn (cid:80) ni =1 E [ A i S i ] → . Finally since we assume k/n → we have proved asymp-totic consistency. B Limitations on the Range of Predictions

Our coverage plots demonstrate signiﬁcant under-coverage at large and small values of the signal. These alsocorrespond to positions in covariate space that are near the edges of the data distribution. While much ofthe observed coverage attenuation may be explained by edge eﬀects – when most of data points inﬂuencing aprediction are close to the center of the distribution and hence have signals closer to zero – we also identify atruncation eﬀect associated with the use of Newton residuals that may also eﬀect other boosting frameworks.The training data for each of the two random forests are of the form r i = (cid:96) (cid:48) i ( t i ) − (cid:96) (cid:48)(cid:48) i ( t i ) . In the Gaussian casethis can take any real value as (cid:96) (cid:48) i is linear, (cid:96) (cid:48)(cid:48) i is a constant and t i ranges over R . But in case of otherexponential families this range may be limited. Note that output predictions from a random forest areconvex combinations of the input training signals - so if training signals have a lower (and/or upper) boundthen the predictions will also have the same lower (and/or upper) bound. In the binomial case (cid:96) (cid:48) i ( t i ) = y i − n i · e t i e t i = ⇒ (cid:96) (cid:48)(cid:48) i ( t i ) = − n i · e t i (1 + e t i ) = ⇒ r i = (cid:96) (cid:48) i ( t i ) − (cid:96) (cid:48)(cid:48) i ( t i ) = y i − n i p i n i p i (1 − p i ) where p i = e ti e ti ∈ [0 , . The minimum and maximum values of this is achieved when y i is 0 and n i respectively. Consequently min i r i = min i − n i p i n i p i (1 − p i ) = − − max i p i ; max i r i = max i n i − n i p i n i p i (1 − p i ) = 1min i p i Now suppose p = e ˆ η (0) MLE e ˆ η (0) MLE is the MLE corresponding to the ﬁrst constant. Then the range of training datafor the ﬁrst (base) random forest ˆ f (1) is (cid:104) − − p , p (cid:105) . The range of predictions from ˆ η (0) MLE + ˆ f (1) will thus17e (cid:104) log (cid:16) p − p (cid:17) − − p , log (cid:16) p − p (cid:17) + p (cid:105) . Clearly this interval does not cover the whole real line. This iswhy in §4.1 there are some true signals for which the conﬁdence interval coverage is 0% - these signals areoutside the range of predictions for the forest and the variance estimates are not high enough to make up forthe diﬀerence. It is also clearly seen by the above calculation that this theoretical limit of predicted valuesdo not depend on the scale or the number of trials.Note that the theoretical range of predictions by including the second (boost) random forest will be (cid:20) log (cid:18) p − p (cid:19) − − p − − max i p i , log (cid:18) p − p (cid:19) + 1 p + 1min i p i (cid:21) , where max i p i and min i p i will have complicated expressions in terms of p , but nevertheless it will still be abounded interval. A plot of the these ranges as a function of p is given in Figure 5 (y-axis in the pseudo-logscale), where we note a very wide range even at its smallest value.Figure 5: Theoretical range of predictions (link space) for the binomial family MLE r ange stage Base RF Boost RF end max min

In the Poisson case (cid:96) (cid:48) i ( t i ) = y i − e t i = ⇒ (cid:96) (cid:48)(cid:48) i ( t i ) = − e t i = ⇒ r i = (cid:96) (cid:48) i ( t i ) − (cid:96) (cid:48)(cid:48) i ( t i ) = y i e t i − Since y i is always non-negative, r i ≥ − . Thus the possible range of predictions from ˆ η (0) MLE + ˆ f (1) is boundedbelow by ˆ η (0) MLE − ; and ˆ η (0) MLE − if we include the second random forest. This is why in §4.1 there are some18rue signals with value below ˆ η (0) MLE − for which the conﬁdence interval coverage is 0%. Note that ˆ η (0) MLE and hence the lower bound threshold depends on the scale (used for improving the signal-to-noise ratio).Although there isn’t an upper bound and thus highly positive values of the signal obtains a high coverageby the conﬁdence interval.

C The eﬀect of outliers in the Poisson case

In our Poisson simulations we observe that the ﬁrst random forest can sometimes perform worse than theinitial constant in terms of response space MSE even when it improves MSE in the link space, and that this isthen ameliorated by the second random forest, suggesting that the eﬀect is not simply due to variance. Thisbehaviour is also more pronounced at larger subsample fractions. Here we provide a heuristic explanationfor this observation.Recall that in the Poisson case the random forest is trained on responses (cid:0) y i e ti − (cid:1) ni =1 and with the i th datapoint having the chance to be selected for a tree being e t i , where t i is the current prediction. After theMLE-type constant stage t i = ˆ η (0) MLE is constant, i.e., all data points are equally weighted but we expect asmall number of training points to have very high values. These, in turn, will exert signiﬁcant inﬂuence onthe structure of the trees for which they are in-bag: splits that isolate these outliers will exhibit greatestdecrease in squared error. Test points nearby such an outlier will therefore receive very large predictionsfrom each tree for which the outlier was in bag; something that will increase with bag fraction. This is thenexacerbated when the model is exponentiated to be in the response space.Speciﬁcally suppose we have a very large y in the training data. The initial pseudo-residual z = y/e ˆ η (0) MLE − is also large, so that any tree for which the data point is in-bag will give a prediction αz + c where α accountsfor the minimum leaf size and c for the other pseudo-responses in the other leaves. A test point that isconsistently in the same leaf as our outlier (when it is in-bag) will then give a prediction of α (cid:48) z + c (cid:48) , wherethe multiplier α (cid:48) now also accounts for the in-bag fraction; approximately proportional to the subsamplefraction.By itself, this is not particularly detrimental. The error α (cid:48) ( y/e ˆ η (0) MLE − c (cid:48) − f ( x ) may not be terribly large.However, when we exponentiate to get to the response space, the error e α (cid:48) ( y/e ˆ η (0) MLE − c (cid:48) − e f ( x ) can dominatethe remaining errors by virtue of exponentiating an already-large value of y . When we ﬁt a second randomforest (the boost step) then the pseudo-residual corresponding to the outlier is now much smaller, but thereare also large negative pseudo-residuals associated with any training points with high in-leaf proximity tothe outlier. This then allows the boosted step to lower the high MSE from the previous (base) step.We demonstrate this situation empirically as follows. Our model is the same as §4.1 but for only onereplicate, one scaling factor (4) and two subsample fractions. For a particular seed of the pseudo-randomnumber generator in R , we obtain the performance statistics given in Table 3. The MSE is always loweredTable 3: Performance statistics for an example Poisson simulation with poor ﬁrst-stage performance.Link space MSE Response Space MSE Log-likelihoodSubsample fraction 0.2 0.8 0.2 0.8 0.2 0.8Stage 0 2.089 2.089 90.303 90.303 7.745 7.745Stage 1 1.381 1.205 44.115 454.862 10.163 9.038Stage 2 0.714 0.505 27.748 174.375 11.206 10.806as we go through more stages except for a high subsample fraction in the response space where MSE ﬁrstincreases substantially and then decreases somewhat subsequently but still not enough to have a betterperformance that just the constant. Notably the log-likelihood, which is the main target of our optimiser,always increases. For the data corresponding to the table above we’ve plotted the estimated vs true signalsfor the test-points in Figure 6. 19igure 6: Detecting test-points causing high MSE in link or response spaces h ^ MLE ( ) + f^ ( ) h ^ MLE ( ) + f^ ( ) + f^ ( ) . . −2 0 2 4 −2 0 2 41234512345 signal e s t i m a t e estimate vs signal in the link space h ^ MLE ( ) + f^ ( ) h ^ MLE ( ) + f^ ( ) + f^ ( ) . . exp(signal) e x p ( e s t i m a t e ) estimate vs signal in the response space f^ ( ) f^ ( ) ano m a l ou s t e s t po i n t r ando m t e s t po i n t training signal (weighted pseudo−residuals) p r o x i m i t y sc o r e proximity score vs random forest training signal (sample fraction 0.8) f^ ( ) f^ ( ) ano m a l ou s t e s t po i n t r ando m t e s t po i n t training signal (weighted pseudo−residuals) p r o x i m i t y sc o r e proximity score vs random forest training signal (sample fraction 0.2) D Further details of empirical studies

In this section we provide further details of our simulation studies from §4.1. We also discuss how thesimulations could be aﬀected if we had a diﬀerent signal in the link space.

D.1 More detailed plots

We continue with the same model as discussed in §4.1. First we show the all the details missing in Figure1, namely how the likelihood and MSE behave for diﬀerent values of M . We see that the shapes of theplots stay largely the same but the range of values may change depending on M . Also in the response scalewe show the mean improvement of MSE, instead of the median (still in the pseudo-log scale). This thenhighlights the eﬀect that outliers can have when considering the MSE, as discussed before in appendix C.We also include the coverage of the 95% conﬁdence intervals vs the signals in the response space (that of thelink space can be found in Figure 2). We only provide this for subsample fraction 0.2 but for the others theplots look very similar. We see that the second (boost) random forest improve coverage universally. But theconﬁdence interval has bad coverage for probabilities closer to 0 and 1 - although higher number of trials( M ) helps the coverage somewhat. In the Poisson case note that the x-axis is in the log-scale. The lowercoverage for low values of truth ( λ = e f ( x ) ) may be due to the fact that for low values of λ too many y ’s arezeroes thus making it diﬃcult to distinguish between them, for example λ = 0 . vs λ = 0 . , as well as thetruncation eﬀects discussed in Appendix B.Finally we show more detailed plots and other metrics we considered for the ﬁve ﬁxed-points p , . . . , p .• The absolute bias plots are the same as Figure 3 but with more details with all the binomial families(all the value of M ) and all the 5 points being shown.• The value of the variance estimate is averaged over 200 replicates. For the response space the y-axis isin the log scale. We see that the variance estimate increases after boosting in most of the cases, andit decreases with subsample fraction. Also, at test points further from origin the variance decreasesfor the binomial cases but increases for the Poisson family. Similar behaviour is also observed forthe variance estimate vs scale, increasing scale increases the variance for binomial but decreases forPoisson. Additionally for the binomial cases the variance estimate decreases with the number of trials(value of M ).• The variance consistency Figure for the link space is the same as Figure 4 but with more details. Wedon’t observe much of a change in the consistency across diﬀerent value of the M . For the responsespace the consistency is presented in the log-scale. Note that for the binomial cases the actual valuesof the variance and its estimate are small so ﬂuctuations can result in high ratios. On the other handwe see that the variance is very consistent for the Poisson case.• The Kolmogorov-Smirnov statistic for normality is always pretty small for the link space and thus wecan conclude that the asymptotic normality mentioned in 1 probably holds true. For the response22pace the statistic is generally larger as might be expected from nonlinear transformations. We alsosee that the K-S statistic is large for the boosted stage compared to the base random forest and alsofor larger values of subsample fraction. It’s relationship with number of trials (in the binomial case)and scale (in both families) seems mixed.• The MSE is averaged over the 200 replicates and shown in the log-scale. We see that the MSE for pointsnearer the origin is higher after the boost stage than after the base stage, although this behaviour isreversed for points farther away from the origin. The MSE also increases for farther away points, andalso mostly as the subsample fraction increases. Higher scales also have higher MSE in general. Theredoesn’t seem to be any relationship of MSE with number of trials in the binomial case. D.2 Non-linear signal

The model discussed here is exactly the same as in §4.1 except that the signal is f ( x ) = (cid:107) x (cid:107) − √ m insteadof f ( x ) = (cid:80) i =1 x i . √ m is subtracted to have some of the link space signals be negative.The likelihood and MSE always improves for both stages. This improvement is generally better with scaleand slightly better with subsample fraction. The number of trials in the binomial case doesn’t seem to havean eﬀect. Most notably the eﬀect of outliers as discussed in §4.1 seem to be non-existent in the Poisson case.This is probably because for test-points far away from the origin the signal (norm) smooths out - so the“outliers” deﬁned in the sense of far-away points don’t actually have extreme signals with which to aﬀect theresiduals and/or predictions. Thus adding random forests always seems to improve the MSE in the responsespace.Next we look at coverage of the 95% conﬁdence interval vs the true value of the signal (pseudo-log scale). Wepresent plots for only one subsample fraction as the other ones look very similar. In the link space, similarto the case for the linear signal (ﬁgure 2), the boost step improves coverage everywhere and the extremevalues have zero coverage. Also the range of true signal with good coverage is higher compared to Figure2. Note that the norm function increases rapidly away from the origin at ﬁrst, i.e, has a high gradient butthen plateaus farther away. This is why lower values of the true signal are harder to estimate consistentlythan higher values. Thus negative values of f ( x ) in the link space have lower conﬁdence coverage than thecorresponding positive values (for example -2 vs 2); the same thing can be observed for probabilities lowerthan 0.5 for the binomial cases in the response space. For the Poisson family in the response space highervalues of the true signal have low coverage for a related reason - since higher value of the norm have lowgradient, then the particular true signal gets harder to pin down in the conﬁdence interval, something whichbecomes worse with higher scales. Finally the number of trials in the binomial cases (values of M ) seems toreﬁne the conﬁdence coverage somewhat although this eﬀect is diminished with scale.We then show plots corresponding to the 5 ﬁxed points for all the metrics as discussed in appendix D.1• Absolute bias decreases for the boosted stage as compared to just the base random forest. It alsodecreases as the points get farther from the origin - although this eﬀect is much lower in the responsespace. Bias also decreases as subsample fraction increases - most pronounced for the binomial cases inthe response space. The eﬀect of scale is slightly inconclusive - for the binomial cases in the responsespace the absolute bias increases with scale ﬁrst before subsidising - for all other cases bias increaseswith scale. The number of trials in the binomial cases do not seem to have an eﬀect.• The average variance estimate is higher after the second (boosted) stage than the ﬁrst stage and italso seems to increase with subsample size ﬁrst and then decreases - except for the binomial cases inthe in the response space where the boosted stage can sometimes have lower variance than the basestage and the variance may also consistently decrease with subsample fraction. The eﬀect of scale isalso mixed - for the binomial cases in the link space scale increases variance but in the response spaceit is the opposite; for the Poisson family we see the reverse, i.e, scale increases variance in the responsespace but it decreases with scale in the link space. Variance seems to increase slightly as the pointsget farther away from the origin and the number of trials (for binomials) slightly decreases variance inthe link space but doesn’t seem to have any eﬀect in the response space.23igure D.1.1: Improvements in loglikelihood and MSE (link and response spaces) in the pseudo-log scale b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16po i ss on sample fraction Likelihood improvement b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16po i ss on sample fraction Mean MSE improvement in the link space b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16po i ss on sample fraction Mean MSE improvement in the response space scale

01 02 04 08 16 stage difference

Base RF − MLE Boost − Base RF i g u r e D . . : C o v e r ag e o f % c o nﬁd e n ce i n t e r v a l s v s t h e t r u e s i g n a li n t h e r e s p o n s e s p a ce ( s a m p l e f r a c t i o n . ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16

01 02 04 08 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

00 0 . . . . .

00 0 . . . . . t r u t h coverage po i ss on

01 02 04 08 16 . . . . . . . . .

00 0 . . . . .

00 0 . . . . . t r u t h coverage s t age B a s e R F B oo s t R F i g u r e D . . : A b s o l u t e b i a s i n t h e li n k s p a ce b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . . .

100 0 . . . . s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . . . . .

25 0 . . . . .

20 0 . . . . .

20 0 . . . . . .

35 0 . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : A b s o l u t e b i a s i n t h e r e s p o n s e s p a ce b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . . .

020 0 . . . . . . . . . . . . .

25 0 . . . . . s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . . . . . . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : A v e r ag e v a r i a n cee s t i m a t e i n t h e li n k s p a ce b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

00 0 . . . . s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . .

09 0 . . . . .

10 0 . . . .

125 0 . . . .

18 0 . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : A v e r ag e v a r i a n cee s t i m a t e i n t h e r e s p o n s e s p a ce ( l og s c a l e ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . .

050 0 . . . .

050 0 . . .

030 3e − − − − s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . . . . . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : V a r i a n cec o n s i s t e n c y i n t h e li n k s p a ce b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : V a r i a n cec o n s i s t e n c y i n t h e r e s p o n s e s p a ce ( l og s c a l e ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . .

00 0 . . .

00 0 . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : K - S s t a t i s t i c f o r n o r m a li t y i n t h e li n k s p a ce b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16po i ss on p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : K - S s t a t i s t i c f o r n o r m a li t y i n t h e r e s p o n s e s p a ce b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16po i ss on p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : A v e r ag e M S E i n t h e li n k s p a ce ( l og s c a l e ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . .

30 0 . . . .

00 0 . . . . . . . . . . . s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . .

05 0 . . .

03 0 . . .

05 0 . . .

10 0 . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : A v e r ag e M S E i n t h e r e s p o n s e s p a ce ( l og s c a l e ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . .

030 0 . . .

100 0 . . . .

030 0 . . . . s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . . . . . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16po i ss on sample fraction Likelihood improvement b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16po i ss on sample fraction Mean MSE improvement in the link space b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16po i ss on sample fraction Mean MSE improvement in the response space scale

01 02 04 08 16 stage difference

Base RF − MLE Boost − Base RF i g u r e D . . : C o v e r ag e o f % c o nﬁd e n ce i n t e r v a l s v s t h e t r u e s i g n a li n t h e li n k s p a ce ( s a m p l e f r a c t i o n . ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 scale: 01 scale: 02 scale: 04 scale: 08 scale: 16 − − − − − − − − − − − − − − − . . . . .

00 0 . . . . .

00 0 . . . . . t r u t h coverage po i ss on scale: 01 scale: 02 scale: 04 scale: 08 scale: 16 − . . . . .

00 0 . . . . .

00 0 . . . . . t r u t h coverage s t age B a s e R F B oo s t R F i g u r e D . . : C o v e r ag e o f % c o nﬁd e n ce i n t e r v a l s v s t h e t r u e s i g n a li n t h e r e s p o n s e s p a ce ( s a m p l e f r a c t i o n . ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16

01 02 04 08 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

00 0 . . . . .

00 0 . . . . . t r u t h coverage po i ss on

01 02 04 08 16 . . . . . . . . . .

00 0 . . . . .

00 0 . . . . . t r u t h coverage s t age B a s e R F B oo s t R F i g u r e D . . : A b s o l u t e b i a s i n t h e li n k s p a ce ( l og s c a l e ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : A b s o l u t e b i a s i n t h e r e s p o n s e s p a ce ( P o i ss o n f a m il y i n l og s c a l e ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : A v e r ag e v a r i a n cee s t i m a t e i n t h e li n k s p a ce ( l og s c a l e ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . .

00 0 . . . .

00 0 . . . . . s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . . .

100 0 . . . .

100 0 . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : A v e r ag e v a r i a n cee s t i m a t e i n t h e r e s p o n s e s p a ce ( l og s c a l e ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . − − − − −

02 1e − − − − −

02 1e − − − −

02 1e − − − s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . .

30 0 . . .

30 0 . . . .

00 0 . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F

42 The plots for variance consistency shows that our variance estimate is more consistent as scale andsubsample fraction increases - except for the binomial case in the response space where both eﬀectsseems to be the opposite. But as discussed before in appendix D.1 the quantities we’re comparing forthe variance consistency are both very small so the results could be unstable. The variance estimatealso seems to be more consistent for the boosted stage in the link space and for the base stage inthe response space. The consistency metric doesn’t seem to change too much as the testpoints moveaway from the origin. Finally increasing the number of trials in the binomial cases makes the varianceestimate a little bit more consistent in the link space but doesn’t seem to have any eﬀect in the responsespace.• The Kolmogorov-Smirnov statistic for normality has reasonable values - it seems to increase slightlywith subsample fraction and scale. It has a higher value after the boosted stage except for the Poissonfamily in the link space where the base stage has a higher K-S statistic. The distance of the test pointfrom the origin doesn’t seem to have an eﬀect, and neither does the number of trials in the binomialcase.These statistics are always pretty small for the link space and thus we can conclude that asymptoticnormality mentioned in 1 is reasonable. For the response space the statistic is generally larger. Wealso see that the K-S statistic is large for the boosted stage compared to the base random forest andalso for larger values of subsample fraction. It’s relationship with number of trials (in the binomialcase) and scale (in both families) seems mixed.• The mean squared error is predictably smaller after the boosted stage compared to the base stage, butit also decreases as the points move away from the origin. It also increases with scale except for thebinomial cases in the response space. It also decreases slightly with the subsample fraction but thenumber of trials in the binomial case do not seem to have any eﬀect.43 i g u r e D . . : V a r i a n cec o n s i s t e n c y i n t h e li n k s p a ce b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : V a r i a n cec o n s i s t e n c y i n t h e r e s p o n s e s p a ce ( l og s c a l e ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . + + +

07 1e + + +

07 1e + + + + +

06 1e + + + + s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . . .

00 0 . . . .

00 0 . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : K - S s t a t i s t i c f o r n o r m a li t y i n t h e li n k s p a ce b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16po i ss on p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : K - S s t a t i s t i c f o r n o r m a li t y i n t h e r e s p o n s e s p a ce b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16po i ss on p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : A v e r ag e M S E i n t h e li n k s p a ce ( l og s c a l e ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on po i ss on p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . . s a m p l e f r a c t i on sc a l e s t age B a s e R F B oo s t R F i g u r e D . . : A v e r ag e M S E i n t h e r e s p o n s e s p a ce ( l og s c a l e ) b i no m i a l _01b i no m i a l _02b i no m i a l _04b i no m i a l _08b i no m i a l _16 p1 p2 p3 p4 p5 . . . . . . . . . . . . . . . . . . . . − − −

01 1e − − −

01 1e − − − − −