Predictive Inference Is Free with the Jackknife+-after-Bootstrap
PPredictive Inference Is Free with the Jackknife+-after-Bootstrap
Byol Kim ∗1 , Chen Xu †2 , and Rina Foygel Barber ‡11 Department of Statistics, The University of Chicago, Chicago, IL 60637, USA H. Milton Stewart School of Industrial & Systems Engineering, Georgia Institute ofTechnology, Atlanta, GA 30332, USANovember 13, 2020
Abstract
Ensemble learning is widely used in applications to make predictions in complex decision problems—forexample, averaging models fitted to a sequence of samples bootstrapped from the available training data.While such methods offer more accurate, stable, and robust predictions and model estimates, much less isknown about how to perform valid, assumption-lean inference on the output of these types of procedures.In this paper, we propose the jackknife+-after-bootstrap (J+aB), a procedure for constructing a predictiveinterval, which uses only the available bootstrapped samples and their corresponding fitted models, and istherefore “free” in terms of the cost of model fitting. The J+aB offers a predictive coverage guaranteethat holds with no assumptions on the distribution of the data, the nature of the fitted model, or the wayin which the ensemble of models are aggregated—at worst, the failure rate of the predictive interval isinflated by a factor of 2. Our numerical experiments verify the coverage and accuracy of the resultingpredictive intervals on real data.
Ensemble learning is a popular technique for enhancing the performance of machine learning algorithms. It isused to capture a complex model space with simple hypotheses which are often significantly easier to learn,or to increase the accuracy of an otherwise unstable procedure [see Hastie et al., 2009, Polikar, 2006, Rokach,2010, and references therein].While ensembling can provide substantially more stable and accurate estimates, relatively little is knownabout how to perform provably valid inference on the resulting output. Particular challenges arise whenthe data distribution is unknown, or when the base learner is difficult to analyze. To consider a motivatingexample, suppose that each observation consists of a vector of features X ∈ R p and a real-valued response Y ∈ R . Even in an idealized scenario where we might be certain that the data follow a linear model, it isstill not clear how we might perform inference on a bagged prediction obtained by, say, averaging the Lassopredictions on multiple bootstrapped samples of the data.To address the problem of valid statistical inference for ensemble predictions, we propose a method forconstructing a predictive confidence interval for a new observation that can be wrapped around existingensemble prediction methods. Our method integrates ensemble learning with the recently proposed jackknife+ [Barber et al., 2019]. It is implemented by tweaking how the ensemble aggregates the learned predictions.This makes the resulting integrated algorithm to output an interval-valued prediction that, when run at atarget predictive coverage level of − α , provably covers the new response value at least − α proportion ofthe time in the worst case, with no assumptions on the data beyond independent and identically distributedsamples.Our main contributions are as follows. ∗ [email protected] † [email protected] ‡ [email protected] a r X i v : . [ s t a t . M E ] N ov We propose the jackknife+-after-bootstrap (J+aB), a method for constructing predictive confidenceintervals that can be efficiently wrapped around an ensemble learning algorithm chosen by the user.• We prove that the coverage of a J+aB interval is at worst − α for the assumption-free theory. Thislower bound is non-asymptotic, and holds for any sample size and any distribution of the data.• We verify that the empirical coverage of a J+aB interval is actually close to − α . Suppose we are given n independent and identically distributed (i.i.d.) observations ( X , Y ) , . . . , ( X n , Y n ) iid ∼ P from some probability distribution P on R p × R . Given the available training data, we would like to predictthe value of the response Y n +1 for a new data point with features X n +1 , where we assume that ( X n +1 , Y n +1 ) isdrawn from the same probability distribution P . A common framework is to fit a regression model (cid:98) µ : R p → R by applying some regression algorithm to the training data { ( X i , Y i ) } ni =1 , and then predicting (cid:98) µ ( X n +1 ) asour best estimate of the unseen test response Y n +1 .However, the question arises: How can we quantify the likely accuracy or error level of these predictions?For example, can we use the available information to build an interval around our estimate (cid:98) µ ( X n +1 ) ± (somemargin of error) that we believe is likely to contain Y n +1 ? More generally, we want to build a predictiveinterval (cid:98) C ( X n +1 ) ⊆ R that maps the test features X n +1 to an interval (or more generally, a set) believedto contain Y n +1 . Thus, instead of (cid:98) µ : R p → R , we would like our method to output (cid:98) C : R p → R with theproperty P (cid:104) Y n +1 ∈ (cid:98) C ( X n +1 ) (cid:105) ≥ − α, (1)where the probability is with respect to the distribution of the n + 1 training and test data points (as well asany additional source of randomness used in obtaining (cid:98) C ). Ideally, we want (cid:98) C to satisfy (1) for any datadistribution P . Such (cid:98) C is said to satisfy distribution-free predictive coverage at level − α . One of the methods that can output (cid:98) C with distribution-free predictive coverage is the recent jackknife+of Barber et al. [2019] which inspired our work. As suggested by the name, the jackknife+ is a simplemodification of the jackknife approach to constructing predictive confidence intervals.To define the jackknife and the jackknife+, we begin by introducing some notation. Let R denote anyregression algorithm; R takes in a training data set, and outputs a model (cid:98) µ : R p → R , which can then be usedto map a new X to a predicted Y . We will write (cid:98) µ = R ( { ( X i , Y i ) } ni =1 ) for the model fitted on the full trainingdata, and will also write (cid:98) µ \ i = R ( { ( X j , Y j ) } nj =1 ,j (cid:54) = i ) for the model fitted on the training data without thepoint i . Let q + α,n { v i } and q − α,n { v i } denote the upper and the lower α -quantiles of a list of n values indexed by i , that is to say, q + α,n { v i } = the (cid:100) (1 − α )( n + 1) (cid:101) -th smallest value of v , . . . , v n , and q − α,n { v i } = − q + α,n {− v i } .The jackknife prediction interval is given by (cid:98) C J α,n ( x ) = (cid:98) µ ( x ) ± q + α,n { R i } = (cid:2) q − α,n { (cid:98) µ ( x ) − R i } , q + α,n { (cid:98) µ ( x ) + R i } (cid:3) , (2)where R i = | Y i − (cid:98) µ \ i ( X i ) | is the i -th leave-one-out residual. This is based on the idea that the R i ’s are goodestimates of the test residual | Y n +1 − (cid:98) µ \ i ( X n +1 ) | , because the data used to train (cid:98) µ \ i is independent of ( X i , Y i ) .Perhaps surprisingly, it turns out that fully assumption-free theory is impossible for (2) [see Barber et al.,2019, Theorem 2]. By contrast, it is achieved by the jackknife+, which modifies (2) by replacing (cid:98) µ with (cid:98) µ \ i ’s: (cid:98) C J+ α,n ( x ) = (cid:2) q − α,n { (cid:98) µ \ i ( x ) − R i } , q + α,n { (cid:98) µ \ i ( x ) + R i } (cid:3) . (3)Barber et al. [2019] showed that (3) satisfies distribution-free predictive coverage at level − α . Intuitively,the reason that such a guarantee is impossible for (2) is that the test residual | Y n +1 − (cid:98) µ ( X n +1 ) | is not quitecomparable with the leave-one-out residuals | Y i − (cid:98) µ \ i ( X i ) | , because (cid:98) µ always sees one more observation intraining than (cid:98) µ \ i sees. The jackknife+ correction restores the symmetry, making assumption-free theorypossible. 2 .2 Ensemble methods In this paper, we are concerned with ensemble predictions that apply a base regression method R , such aslinear regression or the Lasso, to different training sets generated from the training data by a resamplingprocedure.Specifically, the ensemble method starts by creating multiple training data sets (or multisets) of size m from the available training data points { , . . . , n } . We may choose the sets by bootstrapping (sampling m indices uniformly at random with replacement—a typical choice is m = n ), or by subsampling (samplingwithout replacement, for instance with m = n/ ).For each b , the algorithm calls on R to fit the model (cid:98) µ b using the training set S b , and then aggregates the B predictions (cid:98) µ ( x ) , . . . , (cid:98) µ B ( x ) into a single final prediction (cid:98) µ ϕ ( x ) via an aggregation function ϕ , typicallychosen to be a simple function such as the median, mean, or trimmed mean. When ϕ is the mean, theensemble method run with bootstrapped S b ’s is referred to as bagging [Breiman, 1996], while if we insteaduse subsampled S b ’s, then this ensembling procedure is referred to as subagging [Bühlmann and Yu, 2002].The procedure is formalized in Algorithm 1. Algorithm 1
Ensemble learning
Input:
Data { ( X i , Y i ) } ni =1 Output:
Ensembled regression function (cid:98) µ ϕ for b = 1 , . . . , B do Draw S b = ( i b, , . . . , i b,m ) by sampling with or without replacement from { , . . . , n } .Compute (cid:98) µ b = R (( X i b, , Y i b, ) , . . . , ( X i b,m , Y i b,m )) . end for Define (cid:98) µ ϕ = ϕ ( (cid:98) µ , . . . , (cid:98) µ B ) . Can we apply the jackknife+ to an ensemble method?
While ensembling is generally understoodto provide a more robust and stable prediction as compared to the underlying base algorithm, there aresubstantial difficulties in developing inference procedures for ensemble methods with theoretical guarantees.For one thing, ensemble methods are frequently used with highly discontinuous and nonlinear base learners,and aggregating many of them leads to models that defy an easy analysis. The problem is compounded bythe fact that ensemble methods are typically employed in settings where good generative models of the datadistribution are either unavailable or difficult to obtain. This makes distribution-free methods that can wraparound arbitrary machine learning algorithms, such as the conformal prediction [Vovk et al., 2005, Lei et al.,2018], the split conformal [Papadopoulos, 2008, Vovk, 2013, Lei et al., 2018], or cross-validation or jackknifetype methods [Barber et al., 2019] attractive, as they retain validity over any data distribution. However,when deployed with ensemble prediction methods which often require a significant overhead from the extracost of model fitting, the resulting combined procedures tend to be extremely computationally intensive,making them impractical in most settings. In the case of the jackknife+, if each ensembled model makes B many calls to the base regression method R , the jackknife+ would require a total of Bn calls to R . Bycontrast, our method will require only O ( B ) many calls to R , making the computational burden comparableto obtaining a single ensemble prediction. Our paper contributes to the fast-expanding literature on distribution-free predictive inference, which hasgarnered attention in recent years for providing valid inferential tools that can work with complex machinelearning algorithms such as neural networks. This is because many of the methods proposed are “wrapper”algorithms that can be used in conjunction with an arbitrary learning procedures. This list includes theconformal prediction methodology of Vovk et al. [2005], Lei et al. [2018], the split conformal methods explored Formally, we define ϕ as a map from (cid:83) k ≥ R k → R , mapping any collection of predictions in R to a single aggregatedprediction. (If the collection is empty, we would simply output zero or some other default choice). ϕ lifts naturally to a mapon vectors of functions, by writing (cid:98) µ ϕ = ϕ ( (cid:98) µ , . . . , (cid:98) µ B ) , where (cid:98) µ ϕ ( x ) is defined for each x ∈ R by applying ϕ to the collection ( (cid:98) µ ( x ) , . . . , (cid:98) µ B ( x )) . n test predictions R ϕ Ensemble
B Bn test n test J+ with Ensemble
Bn Bn (1 + n test ) n (1 + n test ) J+aB
B B ( n + n test ) n (1 + n test ) in Papadopoulos [2008], Vovk [2013], Lei et al. [2018], and the jackknife+ of Barber et al. [2019]. Meanwhile,methods such as cross-validation or leave-one-out cross-validation (also called the “jackknife”) stabilize theresults in practice but require some assumptions to analyze theoretically [Steinberger and Leeb, 2016, 2018,Barber et al., 2019].The method we propose can also be viewed as a wrapper designed specifically for use with ensemblelearners. As mentioned in Section 2.2, applying a distribution-free wrapper around an ensemble predictionmethod typically results in a combined procedure that is computationally burdensome. This has motivatedmany authors to come up with cost efficient wrappers for use in the ensemble prediction setting. For example,Papadopoulos et al. [2002], Papadopoulos and Haralambous [2011] use a holdout set to assess the predictiveaccuracy of an ensembled model. However, when the sample size n is limited, one may achieve more accuratepredictions with a cross-validation or jackknife type method, as such a method avoids reducing the sample sizein order to obtain a holdout set. Moreover, by using “out-of-bag” estimates [Breiman, 1997], it is often possibleto reduce the overall cost to the extent that it is on par with obtaining a single ensemble prediction. This isexplored in Johansson et al. [2014], where they propose a prediction interval of the form (cid:98) µ ϕ ( X n +1 ) ± q + α,n ( R i ) ,where (cid:98) µ ϕ \ i = ϕ ( { (cid:98) µ b : b = 1 , . . . , B, S b (cid:54)(cid:51) i } ) and R i = | Y i − (cid:98) µ ϕ \ i ( X i ) | . Zhang et al. [2019] provide a theoreticalanalysis of this type of prediction interval, ensuring that predictive coverage holds asymptotically underadditional assumptions. Devetyarov and Nouretdinov [2010], Löfström et al. [2013], Boström et al. [2017b,a],Linusson et al. [2019] study variants of this type of method, but fully distribution-free coverage cannot beguaranteed for these methods. By contrast, our method preserves exchangeability, and hence is able tomaintain assumption-free and finite-sample validity.More recently, Kuchibhotla and Ramdas [2019] looked at aggregating conformal inference after subsamplingor bootstrapping. Their work proposes ensembling multiple runs of an inference procedure, while in contrastour present work seeks to provide inference for ensembled methods.Stepping away from distribution-free methods, for the popular random forests [Ho, 1995, Breiman, 2001],Meinshausen [2006], Athey et al. [2019], Lu and Hardin [2019] proposed methods for estimating conditionalquantiles, which can be used to construct prediction intervals. The guarantees they provide are necessarilyapproximate or asymptotic, and rely on additional conditions. Tangentially related are the methods forestimating the variance of the random forest estimator of the conditional mean, e.g., Sexton and Laake[2009], Wager et al. [2014], Mentch and Hooker [2016], which apply, in order, the jackknife-after-bootstrap(not jackknife+) [Efron, 1992] or the infinitesimal jackknife [Efron, 2014] or U-statistics theory. Roy andLarocque [2019] propose a heuristic for constructing prediction intervals using such variance estimates. Fora comprehensive survey of statistical work related to random forests, we refer the reader to the literaturereview by Athey et al. [2019]. We present our method, the jackknife+-after-bootstrap (J+aB). To design a cost efficient wrapper methodsuited to the ensemble prediction setting, we borrow an old insight from Breiman [1997] and make use ofthe “out-of-bag” estimates. Specifically, it is possible to obtain the i -th leave-one-out model (cid:98) µ ϕ \ i withoutadditional calls to the base regression method by reusing the already computed (cid:98) µ , . . . , (cid:98) µ B by aggregatingonly those (cid:98) µ b ’s whose underlying training data set S b did not include the i -th data point. This is formalizedin Algorithm 2.Because the J+aB algorithm recycles the same B models (cid:98) µ , . . . , (cid:98) µ B to compute all n leave-one-outmodels (cid:98) µ ϕ \ i , the cost of model fitting is identical for the J+aB algorithm and the ensemble learning. Table 1compares the computational costs of an ensemble method, the jackknife+ wrapped around an ensemble,4 lgorithm 2 Jackknife+-after-bootstrap (J+aB)
Input:
Data { ( X i , Y i ) } ni =1 Output:
Predictive interval (cid:98) C J+aB α,n,B for b = 1 , . . . , B do Draw S b = ( i b, , . . . , i b,m ) by sampling with or without replacement from { , . . . , n } .Compute (cid:98) µ b = R (( X i b, , Y i b, ) , . . . , ( X i b,m , Y i b,m )) . end forfor i = 1 , . . . , n do Aggregate (cid:98) µ ϕ \ i = ϕ ( { (cid:98) µ b : b = 1 , . . . , B, S b (cid:54)(cid:51) i } ) .Compute the residual, R i = | Y i − (cid:98) µ ϕ \ i ( X i ) | . end for Compute the J+aB prediction interval: at each x ∈ R , (cid:98) C J+aB α,n,B ( x ) = (cid:2) q − α,n { (cid:98) µ ϕ \ i ( x ) − R i } , q + α,n { (cid:98) µ ϕ \ i ( x ) + R i } (cid:3) . and the J+aB when the goal is to make n test predictions. In settings where both model evaluations andaggregations remain relatively cheap, our J+aB algorithm is able to output a more informative confidenceinterval at virtually no extra cost beyond what is already necessary to produce a single ensemble pointprediction. Thus, one can view the J+aB as offering predictive inference “free of charge.” In this section, we prove that the coverage of a J+aB interval satisfies a distribution-free lower-bound of − α in the worst-case. We make two assumptions, one on the data distribution and the other on theensemble algorithm. Assumption 1. ( X , Y ) , . . . , ( X n , Y n ) , ( X n +1 , Y n +1 ) iid ∼ P , where P is any distribution on R p × R . Assumption 2.
For k ≥ , any fixed k -tuple (( x , y ) , . . . , ( x k , y k )) ∈ R p × R , and any permutation σ on { , . . . , k } , it holds that R (( x , y ) , . . . , ( x k , y k )) = R (( x σ (1) , y σ (1) ) , . . . , ( x σ ( k ) , y σ ( k ) )) and ϕ ( y , . . . , y k ) = ϕ ( y σ (1) , . . . , y σ ( k ) ) . In other words, the base regression algorithm R and the aggregation ϕ are both invariantto the ordering of the input arguments. Assumption 1 is fairly standard in the distribution-free prediction literature [Vovk et al., 2005, Lei et al.,2018, Barber et al., 2019]. In fact, our results only require exchangeability of the n + 1 data points, as istypical in distribution-free inference—the i.i.d. assumption is a familiar special case. Assumption 2 is anatural condition in the setting where the data points are i.i.d., and therefore should logically be treatedsymmetrically.Theorem 1 gives the distribution-free coverage guarantee for the J+aB prediction interval with oneintriguing twist: the total number of base models, B , must be drawn at random rather than chosen inadvance. This is because Algorithm 2 as given subtly violates symmetry even when R and ϕ are themselvessymmetric. However, we shall see that requiring B to be Binomial is enough to restore symmetry, after whichassumption-free theory is possible. Theorem 1.
Fix any integers (cid:101) B ≥ and m ≥ , any base algorithm R , and any aggregation function ϕ .Suppose the jackknife+-after-bootstrap method (Algorithm 2) is run with (i) B ∼ Binomial( (cid:101) B, (1 − n +1 ) m ) in the case of sampling with replacement or (ii) B ∼ Binomial( (cid:101) B, − mn +1 ) in the case of sampling without replacement. Then, under Assumptions 1 and 2, the jackknife+-after-bootstrap prediction interval satisfies P (cid:104) Y n +1 ∈ (cid:98) C J+aB α,n,B ( X n +1 ) (cid:105) ≥ − α, If R and/or ϕ involve any randomization—for example if ϕ operates by sampling from the collection of predictions—then wecan require that the outputs are equal in distribution under any permutation of the input arguments, rather than requiring thatequality holds deterministically. In this case, the coverage guarantees in our theorems hold on average over the randomization in R and/or ϕ , in addition to the distribution of the data. here the probability holds with respect to the random draw of the training data ( X , Y ) , . . . , ( X n , Y n ) , thetest data point ( X n +1 , Y n +1 ) , and the Binomial B .Proof sketch. Our proof follows the main ideas of the jackknife+ guarantee [Barber et al., 2019, Theorem 1].It is a consequence of the jackknife+ construction that the guarantee can be obtained by a simultaneouscomparison of all n pairs of leave-one-out(-of- n ) residuals, | Y n +1 − (cid:98) µ \ i ( X n +1 ) | vs | Y i − (cid:98) µ \ i ( X i ) | for i = 1 , . . . , n .The key insight provided by Barber et al. [2019] is that this is easily done by regarding the residuals asleave- two -out(-of- ( n + 1) ) residuals | Y i − (cid:101) µ \ i,j ( X i ) | with { i, j } (cid:51) ( n + 1) , where (cid:101) µ \ i,j is a model trained on the augmented data combining both training and test points and then screening out the i -th and the j -th points,one of which is the test point. These leave-two-out residuals are naturally embedded in an ( n + 1) × ( n + 1) array of all the leave-two-out residuals, R = [ R ij = | Y i − (cid:101) µ \ i,j ( X i ) | : i (cid:54) = j ∈ { , . . . , n, n + 1 } ] . Since the n + 1 data points in the augmented data are i.i.d., they are exchangeable, and hence so is the array R , i.e., thedistribution of R is invariant to relabeling of the indices. A simple counting argument then ensures that thejackknife+ interval fails to cover with probability at most α . This is the essence of the jackknife+ proof.Turning to our J+aB, it may be tempting to define (cid:101) µ ϕ \ i,j = ϕ ( { (cid:98) µ b : S b (cid:54)(cid:51) i, j } ) , the aggregation of all (cid:98) µ b ’swhose underlying data set S b excludes i and j , and go through with the jackknife+ proof. However, thisconstruction is no longer useful; the corresponding R in this case is no longer exchangeable. This is mosteasily seen by noting that there are always exactly B many S b ’s that do not include the test observation n + 1 , whereas the number of S b ’s that do not contain a particular training observation i ∈ { , . . . , n } isa random number usually smaller than B . The issue here is that the J+aB algorithm as given fails to besymmetric for all n + 1 data points.However, just as the jackknife+ symmetrized the jackknife by replacing (cid:98) µ with (cid:98) µ \ i ’s, the J+aB can also besymmetrized by merely requiring it to run with a Binomial B . To see why, consider the “lifted” Algorithm 3. Algorithm 3
Lifted J+aB residuals
Input:
Data { ( X i , Y i ) } n +1 i =1 Output:
Residuals ( R ij : i (cid:54) = j ∈ { , . . . , n + 1 } ) for b = 1 , . . . , (cid:101) B do Draw (cid:101) S b = ( i b, , . . . , i b,m ) by sampling with or without replacement from { , . . . , n + 1 } .Compute (cid:101) µ b = R (( X i b, , Y i b, ) , . . . , ( X i b,m , Y i b,m )) . end forfor pairs i (cid:54) = j ∈ { , . . . , n + 1 } do Aggregate (cid:101) µ ϕ \ i,j = ϕ ( { (cid:101) µ b : (cid:101) S b (cid:54)(cid:51) i, j } ) .Compute the residual, R ij = | Y i − (cid:101) µ ϕ \ i,j ( X i ) | . end for Because all n + 1 data points are treated equally by Algorithm 3, the resulting array of residuals R = [ R ij : i (cid:54) = j ∈ { , . . . , n + 1 } ] is again exchangeable. Now, for each i = 1 , . . . , n + 1 , define (cid:101) E i as the eventthat (cid:80) j ∈{ ,...,n +1 }\{ i }
1I [ R ij > R ji ] ≥ (1 − α )( n + 1) . Because of the exchangeability of the array, the samecounting argument mentioned above ensures P [ (cid:101) E n +1 ] ≤ α .To relate the event (cid:101) E n +1 to the actual J+aB interval (cid:98) C J+aB α,n,B ( X n +1 ) being constructed, we need to coupleAlgorithms 2 and 3. Let B = (cid:80) (cid:101) Bb =1 (cid:101) S b (cid:54)(cid:51) n + 1] , the number of (cid:101) S b ’s containing only the training datain the lifted construction, and let ≤ b < · · · < b B ≤ (cid:101) B be the indices of such (cid:101) S b ’s. Note that B is Binomially distributed, as required by the theorem. For each k = 1 , . . . , B , define S k = (cid:101) S b k . Then,each S k is an independent uniform draw from { , . . . , n } , with or without replacement. Therefore, we canequivalently consider running Algorithm 2 with these particular S , . . . , S B . Furthermore, this ensures that (cid:101) µ ϕ \ n +1 ,i = (cid:98) µ ϕ \ i for each i , that is, the leave-one-out models in Algorithm 2 coincide with the leave-two-outmodels in Algorithm 3. Thus, we have constructed a coupling of the J+aB with its lifted version.Finally, define E n +1 as the event that (cid:80) ni =1 (cid:2) | Y n +1 − (cid:98) µ ϕ \ i ( X n +1 ) | > R i (cid:3) ≥ (1 − α )( n + 1) , where R i = | Y i − (cid:98) µ ϕ \ i ( X i ) | as before. By the coupling we have constructed, we can see that the event E n +1 isequivalent to the lifted event (cid:101) E n +1 , and thus, P [ E n +1 ] = P [ (cid:101) E n +1 ] ≤ α . It can be verified that in the event thatthe J+aB interval fails to cover, i.e., if Y n +1 / ∈ (cid:98) C J+aB α,n,B ( X n +1 ) , the event E n +1 must occur, which concludesthe proof. The full version of this proof is given in Appendix A.6 .80.91.0 M E P S RIDGE Coverage RF NN B L O G . . . . . m/n C O MM U N I T I E S . . . . . m/n J+ NON-ENSEMBLE J+ ENSEMBLE J+aB . . . . . m/n Figure 1: Distributions of coverage (averaged over each test data) in 10 independent splits for ϕ = Mean .The black line indicates the target coverage of − α .In most settings where a large number of models are being aggregated, we would not expect the distinctionof random vs fixed B to make a meaningful difference to the final output. In Appendix B, we formalize thisintuition and give a stability condition on the aggregating map ϕ under which the J+aB has valid coveragefor any choice of B .Finally, we remark that although we have exclusively used the regression residuals | Y i − (cid:98) µ \ i ( X i ) | in ourexposition for concreteness, our method can also accommodate alternative measures of conformity, e.g., usingquantile regression as in Romano et al. [2019] or weighted residuals as in Lei et al. [2018] which can betterhandle heteroscedasticity. More generally, if (cid:98) c ϕ \ i is the trained conformity measure aggregated from the S b ’sthat did not use the i -th point, then the corresponding J+aB set is given by (cid:98) C c -J+aB α,n,B ( x ) = (cid:40) y : n (cid:88) i =1 (cid:2)(cid:98) c ϕ \ i ( x, y ) > (cid:98) c ϕ \ i ( X i , Y i ) (cid:3) < (1 − α )( n + 1) (cid:41) . Experiments
In this section, we demonstrate that the J+aB intervals enjoy coverage near the nominal level of − α numerically, using three real data sets and different ensemble prediction methods. In addition, we also lookat the results for the jackknife+, combined either with the same ensemble method ( J+ensemble ) or withthe non-ensembled base method (
J+non-ensemble ); the precise definitions are given in Appendix D.1. Thecode is available online. We used three real data sets, which were also used in Barber et al. [2019], following the same datapreprocessing steps as described therein. The Communities and Crime (
Communities ) data set [Redmondand Baveja, 2002] contains information on 1994 communities with p = 99 covariates. The response Y is theper capita violent crime rate. The BlogFeedback ( Blog ) data set [Buza, 2014] contains information on 52397blog posts with p = 280 covariates. The response is the number of comments left on the blog post in thefollowing 24 hours, which we transformed as Y = log(1 + comments ) . The Medical Expenditure PanelSurvey ( MEPS ) 2016 data set from the Agency for Healthcare Research and Quality, with details for olderversions in Ezzati-Rice et al. [2008], contains information on 33005 individuals with p = 107 covariates. Theresponse is a score measuring each individual’s utilization level of medical services. We transformed this as Y = log(1 + utilization score ) .For the base regression method R , we used either the ridge regression ( Ridge ), the random forest ( RF ),or a neural network ( NN ). For Ridge , we set the penalty at λ = 0 . (cid:107) X (cid:107) , where (cid:107) X (cid:107) is the spectralnorm of the training data matrix. RF was implemented using the RandomForestRegressor method from scikit-learn with 20 trees grown for each random forest using the mean absolute error criterion andthe bootstrap option turned off, with default settings otherwise. For NN , we used the MLPRegressor method from scikit-learn with the L-BFGS solver and the logistic activation function, with default settingsotherwise. For the aggregation ϕ , we used averaging ( Mean ). Results obtained with other aggregationmethods are discussed in Appendix D.2.We fixed α = 0 . for the target coverage of 90%. We used n = 40 observations for training, samplinguniformly without replacement to create a training-test split for each trial. The results presented here arefrom 10 independent training-test splits of each data set. The ensemble wrappers J+aB and J+ensemble used sampling with replacement. We varied the size m of each bootstrap replicate as m/n = 0 . , . , . . . , . .For J+ensemble , we used B = 20 . For the J+aB, we drew B ∼ Binomial( (cid:101) B, (1 − n +1 ) m ) with (cid:101) B =[20 / { (1 − n +1 ) m (1 − n ) m } ] , where [ · ] refers to the integer part of the argument. This ensures that the numberof models being aggregated for each leave-one-out model is matched on average to the number in J+ensemble .We remark that the scale of our experiments, as reflected in the number of different training-test splits or thesize of n or B , has been limited by the computationally inefficient J+ensemble .We emphasize that we made no attempt to optimize any of our models. This is because our goal here is toillustrate certain properties of our method that hold universally for any data distribution and any ensemblemethod, and not just in cases when the method happens to be the “right” one for the data. All other thingsbeing equal, the statistical efficiency of the intervals our method constructs would be most impacted by howaccurately the model is able to capture the data. However, because the method we propose leaves this choiceup to the users, performance comparisons along the axis of different ensemble methods are arguably not verymeaningful.We are rather more interested in comparisons of the J+aB and
J+ensemble , and of the J+aB (or
J+ensemble ) and
J+non-ensemble . For the J+aB vs
J+ensemble comparison, we are on the lookoutfor potential systematic tradeoffs between computational and statistical efficiency. For each i , conditionalon the event that the same number of models were aggregated for the i -th leave-one-out models (cid:98) µ ϕ \ i in theJ+aB and J+ensemble , the two (cid:98) µ ϕ \ i ’s have the same marginal distribution. However, this is not the casefor the joint distribution of all n leave-one-out models { (cid:98) µ ϕ \ i } ni =1 ; with respect to the resampling measure,the collection is highly correlated in the case of the J+aB, and independent in the case of J+ensemble .Thus, in principle, the statistical properties of (cid:98) C J+aB α,n,B and (cid:98) C J+ensemble α,n,B (cid:48) could differ, although it would bea surprise if it were to turn out that one method always performed better than the other. In comparingthe J+aB (or
J+ensemble ) and
J+non-ensemble , we seek to reaffirm some known results in bagging.It is well-known that bagging improves the accuracy of unstable predictors, but has little effect on stable .06.08.0 M E P S RIDGE Width RF NN B L O G . . . . . m/n C O MM U N I T I E S . . . . . m/n J+ NON-ENSEMBLE J+ ENSEMBLE J+aB . . . . . m/n Figure 2: Distributions of interval width (averaged over each test data) in 10 independent splits for ϕ = Mean .ones [Breiman, 1996, Bühlmann and Yu, 2002]. It is reasonable to expect that this property will manifest insome way when the width of (cid:98) C J+aB α,n,B (or (cid:98) C J+ensemble α,n,B (cid:48) ) is compared to that of (cid:98) C J+non-ensemble α,n . We expect theformer to be narrower than the latter when the base regression method is unstable (e.g., RF ), but not sowhen it is already stable (e.g., Ridge ).Figures 1 and 2 summarize the results of our experiments. First, from Figure 1, it is clear that the coverageof the J+aB is near the nominal level. This is also the case for
J+ensemble or J+non-ensemble . Second,in Figure 2, we observe no evidence of a consistent trend of one method always outperforming the other interms of the precision of the intervals, although we do see some slight variations across different data setsand regression algorithms. Thus, we prefer the computationally efficient J+aB to the costly
J+ensemble .Finally, comparing the J+aB (or
J+ensemble ) and
J+non-ensemble , we find the effect of bagging reflectedin the interval widths, and we see improved precision in the case of RF , and for some data sets and atsome values of m , in the case of NN . Thus, in settings where the base learner is expected to benefit fromensembling, the J+aB is a practical method for obtaining informative prediction intervals that requires alevel of computational resources on par with the ensemble algorithm itself.9 Conclusion
We propose the jackknife+-after-bootstrap (J+aB), a computationally efficient wrapper method tailoredto the setting of ensemble learning, where by a simple modification to the aggregation stage, the methodoutputs a predictive interval with fully assumption-free coverage guarantee in place of a point prediction. TheJ+aB provides a mechanism for quantifying uncertainty in ensemble predictions that is both straightforwardto implement and easy to interpret, and can therefore be easily integrated into existing ensemble models.
Acknowledgments
R.F.B. was supported by the National Science Foundation via grant DMS-1654076. The authors are gratefulto Aaditya Ramdas and Chirag Gupta for helpful suggestions on related work.
References
Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests.
Ann. Statist. , 47(2):1148–1178,2019. doi: 10.1214/18-AOS1709. URL https://projecteuclid.org:443/euclid.aos/1547197251 .Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. Predictive inferencewith the jackknife+, 2019. arXiv preprint.H. Boström, L. Asker, R. Gurung, I. Karlsson, T. Lindgren, and P. Papapetrou. Conformal predictionusing random survival forests. In , pages 812–817, 2017a. ISBN null. doi: 10.1109/ICMLA.2017.00-57.Henrik Boström, Henrik Linusson, Tuve Löfström, and Ulf Johansson. Accelerating difficulty estimation forconformal regression forests.
Annals of Mathematics and Artificial Intelligence , 81(1):125–144, 2017b. doi:10.1007/s10472-017-9539-9. URL https://doi.org/10.1007/s10472-017-9539-9 .Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.
Concentration Inequalities: A Nonasymp-totic Theory of Independence . Oxford University Press, 2013. ISBN 9780199535255. doi: 10.1093/acprof:oso/9780199535255.001.0001. URL .Leo Breiman. Bagging predictors.
Machine Learning , 24(2):123–140, Aug 1996. doi: 10.1007/BF00058655.URL https://doi.org/10.1007/BF00058655 .Leo Breiman. Out-of-bag estimation. Technical report, Department of Statistics, University of California,Berkeley, 1997.Leo Breiman. Random forests.
Machine Learning , 45(1):5–32, Oct 2001. ISSN 1573-0565. doi: 10.1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324 .Peter Bühlmann and Bin Yu. Analyzing bagging.
Ann. Statist. , 30(4):927–961, Aug 2002. doi: 10.1214/aos/1031689014. URL https://doi.org/10.1214/aos/1031689014 .Andreas Buja and Werner Stuetzle. Observations on bagging.
Statist. Sinica , 16(2):323–351, 2006. ISSN1017-0405.Krisztian Buza. Feedback prediction for blogs. In Myra Spiliopoulou, Lars Schmidt-Thieme, and RuthJanning, editors,
Data Analysis, Machine Learning and Knowledge Discovery , pages 145–152. SpringerInternational Publishing, 2014. ISBN 978-3-319-01595-8.Dmitry Devetyarov and Ilia Nouretdinov. Prediction with confidence based on a random forest classifier. InHarris Papadopoulos, Andreas S. Andreou, and Max Bramer, editors,
Artificial Intelligence Applications andInnovations , pages 37–44, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. ISBN 978-3-642-16239-8.10radley Efron. Jackknife-after-bootstrap standard errors and influence functions.
Journal of the RoyalStatistical Society. Series B (Methodological) , 54(1):83–127, 1992. ISSN 00359246. URL .Bradley Efron. Estimation and accuracy after model selection.
Journal of the American Statistical Association ,109(507):991–1007, 2014. doi: 10.1080/01621459.2013.823775. URL https://doi.org/10.1080/01621459.2013.823775 . PMID: 25346558.Trena M Ezzati-Rice, Frederick Rohde, and Janet Greenblatt. Sample design of the medical expenditurepanel survey household component, 1998–2007. Methodology Report 22, Agency for Healthcare Re-search and Quality, Rockville, MD, Mar 2008. URL .Jerome H. Friedman and Peter Hall. On bagging and nonlinear estimation.
Journal of Statistical Planningand Inference , 137(3):669–683, 2007. ISSN 0378-3758. doi: https://doi.org/10.1016/j.jspi.2006.06.002. URL . Special Issue on Nonpara-metric Statistics and Related Topics: In honor of M.L. Puri.Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
Ensemble Learning , pages 605–624. Springer NewYork, 2009. ISBN 978-0-387-84858-7. doi: 10.1007/978-0-387-84858-7_16. URL https://doi.org/10.1007/978-0-387-84858-7_16 .Tin Kam Ho. Random decision forests. In
Proceedings of 3rd International Conference on Document Analysisand Recognition , volume 1, pages 278–282. IEEE, Aug 1995. doi: 10.1109/ICDAR.1995.598994.Ulf Johansson, Henrik Boström, Tuve Löfström, and Henrik Linusson. Regression conformal predictionwith random forests.
Machine Learning , 97(1):155–176, 2014. doi: 10.1007/s10994-014-5453-0. URL https://doi.org/10.1007/s10994-014-5453-0 .Arun K. Kuchibhotla and Aaditya K. Ramdas. Nested conformal prediction and the generalized jackknife+,2019. arXiv preprint.Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. Distribution-freepredictive inference for regression.
Journal of the American Statistical Association , 113(523):1094–1111,2018. doi: 10.1080/01621459.2017.1307116. URL https://doi.org/10.1080/01621459.2017.1307116 .H. Linusson, U. Johansson, and H. Boström. Efficient conformal predictor ensembles.
Neurocomputing , 2019.doi: https://doi.org/10.1016/j.neucom.2019.07.113. URL .T. Löfström, U. Johansson, and H. Boström. Effective utilization of data in inductive conformal predictionusing ensembles of neural networks. In
The 2013 International Joint Conference on Neural Networks(IJCNN) , pages 1–8, 2013. ISBN 2161-4393. doi: 10.1109/IJCNN.2013.6706817.Benjamin Lu and Johanna Hardin. A unified framework for random forest prediction error estimation, 2019.arXiv preprint.N. Meinshausen. Quantile regression forests.
Journal of Machine Learning Research , 7:983–999,2006. URL .Lucas Mentch and Giles Hooker. Quantifying uncertainty in random forests via confidence intervals andhypothesis tests.
Journal of Machine Learning Research , 17(26):1–41, 2016. URL http://jmlr.org/papers/v17/14-168.html .Harris Papadopoulos. Inductive conformal prediction: Theory and application to neural networks.In Paula Fritzsche, editor,
Tools in Artificial Intelligence , pages 325–330. InTech, 2008. ISBN978-953-7619-03-9. URL .11arris Papadopoulos and Haris Haralambous. Reliable prediction intervals with regression neural networks.
Neural Networks , 24(8):842–851, 2011. doi: https://doi.org/10.1016/j.neunet.2011.05.008. URL .Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machinesfor regression. In Tapio Elomaa, Heikki Mannila, and Hannu Toivonen, editors,
Machine Learning: ECML2002 , pages 345–356, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg. ISBN 978-3-540-36755-0.Robi Polikar. Ensemble based systems in decision making.
IEEE Circuits and Systems Magazine , 6(3):21–45,Mar 2006. ISSN 1558-0830. doi: 10.1109/MCAS.2006.1688199.Michael Redmond and Alok Baveja. A data-driven software tool for enabling cooperative informationsharing among police departments.
European Journal of Operational Research , 141(Mar):660–678, 2002.ISSN 0377-2217. doi: 10.1016/S0377-2217(01)00264-8. URL .Lior Rokach. Ensemble-based classifiers.
Artificial Intelligence Review , 33(1):1–39, Feb 2010. ISSN 1573-7462.doi: 10.1007/s10462-009-9124-7. URL https://doi.org/10.1007/s10462-009-9124-7 .Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in NeuralInformation Processing Systems 32 , pages 3543–3553. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8613-conformalized-quantile-regression.pdf .Marie-Hélène Roy and Denis Larocque. Prediction intervals with random forests.
Statistical Methodsin Medical Research , 29(1):205–229, 2020/03/01 2019. doi: 10.1177/0962280219829885. URL https://doi.org/10.1177/0962280219829885 .Joseph Sexton and Petter Laake. Standard errors for bagged and random forest estimators.
Computa-tional Statistics & Data Analysis , 53(3):801–811, 2009. ISSN 0167-9473. doi: https://doi.org/10.1016/j.csda.2008.08.007. URL .Computational Statistics within Clinical Research.Lukas Steinberger and Hannes Leeb. Leave-one-out prediction intervals in linear regression models with manyvariables, 2016. arXiv preprint.Lukas Steinberger and Hannes Leeb. Conditional predictive inference for high-dimensional stable algorithms,2018. arXiv preprint.Vladimir Vovk. Conditional validity of inductive conformal predictors.
Machine Learning , 92(2-3):349–376, 2013. ISSN 08856125. doi: 10.1007/s10994-013-5355-6. URL .Vladimir Vovk, Alexander Gammerman, and Glenn Shafer.
Algorithmic Learning in a Random World . Springer-Verlag, 2005. ISBN 978-0-387-25061-8. doi: 10.1007/b106715. URL https://doi.org/10.1007/b106715 .Stefan Wager, Trevor Hastie, and Bradley Efron. Confidence intervals for random forests: The jackknifeand the infinitesimal jackknife.
Journal of Machine Learning Research , 15:1625–1651, 2014. URL http://jmlr.org/papers/v15/wager14a.html .Haozhe Zhang, Joshua Zimmerman, Dan Nettleton, and Daniel J. Nordman. Random forest predictionintervals.
The American Statistician , pages 1–15, 04 2019. doi: 10.1080/00031305.2019.1585288. URL https://doi.org/10.1080/00031305.2019.1585288 .12
Proof of Theorem 1
For completeness, we give the full details of the proof of Theorem 1; a sketch of the proof is presented inSection 4 of the main paper.Denote Algorithm 3 by (cid:101) A . We view (cid:101) A as mapping a given input { ( X i , Y i ) } n +1 i =1 and a collection ofsubsamples or bootstrapped samples (cid:101) S , . . . , (cid:101) S B to a matrix of residuals R ∈ R ( n +1) × ( n +1) , where R ij = (cid:26) (cid:12)(cid:12) Y i − (cid:101) µ ϕ \ i,j ( X i ) (cid:12)(cid:12) if i (cid:54) = j, if i = j. For any permutation σ on { , . . . , n + 1 } , let Π σ stand for its matrix representation—that is, Π σ ∈{ , } ( n +1) × ( n +1) has entries (Π σ ) σ ( i ) ,i = 1 for each i , and zeros elsewhere. Furthermore, for each sub-sample or bootstrapped sample (cid:101) S b = { i b, , . . . , i b,m } , write σ ( (cid:101) S b ) = { σ ( i b, ) , . . . , σ ( i b,m ) } .We now claim that R d = Π σ R Π (cid:62) σ , (S1)for any fixed permutation σ on { , . . . , n + 1 } . Here R is the residual matrix obtained by a run of Algorithm 3,namely, R = (cid:101) A (cid:16) ( X , Y ) , . . . , ( X n +1 , Y n +1 ); (cid:101) S , . . . , (cid:101) S B (cid:17) . To see why (S1) holds, observe that deterministically, we have Π σ R Π (cid:62) σ = (cid:101) A (cid:16) ( X σ (1) , Y σ (1) ) , . . . , ( X σ ( n +1) , Y σ ( n +1) ); σ ( (cid:101) S ) , . . . , σ ( (cid:101) S B ) (cid:17) . Furthermore, we have (cid:16) ( X , Y ) , . . . , ( X n +1 , Y n +1 ) (cid:17) d = (cid:16) ( X σ (1) , Y σ (1) ) , . . . , ( X σ ( n +1) , Y σ ( n +1) ) (cid:17) by Assumption 1, and (cid:16) (cid:101) S , . . . , (cid:101) S B (cid:17) d = (cid:16) σ ( (cid:101) S ) , . . . , σ ( (cid:101) S B ) (cid:17) since subsampling or resampling treats all the indices the same. Finally, the subsamples or bootstrappedsamples (i.e., the (cid:101) S b ’s) are drawn independently of the data points (i.e., the ( X i , Y i ) ’s). Combining thesecalculations yields (S1).Next, given R , define a “tournament matrix” A = A ( R ) as A ij = (cid:26)
1I [ R ij > R ji ] if i (cid:54) = j, if i = j. It is easily checked that A (Π σ R Π (cid:62) σ ) = Π σ A ( R )Π (cid:62) σ , and hence (S1) implies that A d = Π σ A Π (cid:62) σ . (S2)Let S α ( A ) be the set of row indices with row sums greater than or equal to (1 − α )( n + 1) , i.e., S α ( A ) = i = 1 , . . . , n + 1 : n +1 (cid:88) j =1 A ij ≥ (1 − α )( n + 1) . The argument of Step 3 in the proof of Barber et al. [2019, Theorem 1] applies to the lifted J+aB “tournamentmatrix” A , and it holds deterministically that | S α ( A ) | ≤ α ( n + 1) . (S3)On the other hand, if j is any index, and σ is any permutation that swaps indices n + 1 and j , then P (cid:2) n + 1 ∈ S α ( A ) (cid:3) = P (cid:2) j ∈ S α (Π σ A Π (cid:62) σ ) (cid:3) = P (cid:2) j ∈ S α ( A ) (cid:3) . P (cid:2) n + 1 ∈ S α ( A ) (cid:3) = 1 n + 1 n +1 (cid:88) j =1 P (cid:2) j ∈ S α ( A ) (cid:3) = 1 n + 1 E n +1 (cid:88) j =1 (cid:2) j ∈ S α ( A ) (cid:3) = E | S α ( A ) | n + 1 ≤ α. (S4)Note that the event (cid:2) n + 1 ∈ S α ( A ) (cid:3) is exactly the event (cid:101) E n +1 , defined in Section 4. As described in theproof sketch in Section 4 of the main paper, we can couple this lifted event to the event E n +1 , also defined inSection 4 in terms of the actual J+aB, as follows. Let B = (cid:80) (cid:101) Bb =1 (cid:2) (cid:101) S b (cid:54)(cid:51) n + 1 (cid:3) , the number of (cid:101) S b ’s containingonly training data, and let ≤ b < · · · < b B ≤ (cid:101) B be the corresponding indices. Note that the distribution of B is Binomial, as specified in the theorem. Now, for each k = 1 , . . . , B , define S k = (cid:101) S b k . We can observe thateach S k is an independent uniform draw from { , . . . , n } (with or without replacement). Therefore, we canequivalently consider running the J+aB (Algorithm 2) with these particular subsamples or bootstrappedsamples S , . . . , S B , in which case it holds deterministically that (cid:101) µ ϕ \ n +1 ,i = (cid:98) µ ϕ \ i for each i = 1 , . . . , n . Thisensures that | Y n +1 − (cid:101) µ ϕ \ n +1 ,i ( X n +1 ) | = | Y n +1 − (cid:98) µ ϕ \ i ( X n +1 ) | and | Y i − (cid:101) µ ϕ \ i,n +1 ( X i ) | = | Y i − (cid:98) µ ϕ \ i ( X i ) | , andthus, P [ E n +1 ] = P [ (cid:101) E n +1 ] ≤ α. Finally, as in Step 1 in the proof of Barber et al. [2019, Theorem 1], it easily follows from the definition of (cid:98) C J+aB α,n,B that if Y n +1 / ∈ (cid:98) C J+aB α,n,B ( X n +1 ) then the event E n +1 must occur. Indeed, if Y n +1 / ∈ (cid:98) C J+aB α,n,B ( X n +1 ) , theneither Y n +1 falls below the lower bound, i.e., n (cid:88) i =1 (cid:104) Y n +1 − (cid:98) µ ϕ \ i ( X n +1 ) < (cid:12)(cid:12) Y i − (cid:98) µ ϕ \ i ( X i ) (cid:12)(cid:12) (cid:105) ≥ (1 − α )( n + 1) , or Y n +1 exceeds the upper bound, i.e., n (cid:88) i =1 (cid:104) Y n +1 − (cid:98) µ ϕ \ i ( X n +1 ) > (cid:12)(cid:12) Y i − (cid:98) µ ϕ \ i ( X i ) (cid:12)(cid:12) (cid:105) ≥ (1 − α )( n + 1) , and the above two expressions imply n (cid:88) i =1 (cid:104) (cid:12)(cid:12) Y n +1 − (cid:98) µ ϕ \ i ( X n +1 ) (cid:12)(cid:12) > (cid:12)(cid:12) Y i − (cid:98) µ ϕ \ i ( X i ) (cid:12)(cid:12) (cid:105) ≥ (1 − α )( n + 1) . Therefore, we conclude that P (cid:104) Y n +1 / ∈ (cid:98) C J+aB α,n,B ( X n +1 ) (cid:105) ≤ α, thus proving the theorem. B Guarantees with stability
Many ensembles that are used in practice are variants of bagging, where multiple independent copies of thegiven training data set are generated through a resampling mechanism, after which estimates from differentdata sets are pooled together via an averaging procedure of some kind. Bagging can be understood as asmoothing operation that when applied on a discontinuous base learner, often greatly improve its accuracy[Bühlmann and Yu, 2002, Buja and Stuetzle, 2006, Friedman and Hall, 2007].For ensembles of this type, the aggregated predictions they produce frequently exhibit a concentratingbehavior as B → ∞ , making the corresponding J+aB interval much like a jackknife+ interval. In such cases,it is reasonable to expect a J+aB interval to remain valid regardless of the choice of B , e.g., random with aBinomial distribution or fixed, by its proximity to a jackknife+ interval. Intuitively, this happens when theaggregation is insensitive to any one prediction participating in the ensemble.14o formalize, let E ∗ denote the expectation with respect to the resampling measure — that is, we takethe expectation with respect to the random collection of subsamples or bootstrapped samples S , . . . , S B conditional on all the observed data { ( X i , Y i ) } ni =1 and X n +1 . For example, when ϕ ( · ) = Mean ( · ) is the meanaggregation, E ∗ [ (cid:98) µ Mean ( X n +1 )] = E (cid:2)(cid:98) µ ( X n +1 ) (cid:12)(cid:12) ( X , Y ) , . . . , ( X n , Y n ) , X n +1 (cid:3) , the expected prediction from the model (cid:98) µ fitted on training sample S , where the expectation is taken withrespect to the draw of S . Assumption S1 (Ensemble stability) . For ε ≥ and δ ∈ (0 , , it holds for each i = 1 , . . . , n that P (cid:2)(cid:12)(cid:12)(cid:98) µ ϕ \ i ( X i ) − E ∗ (cid:2)(cid:98) µ ϕ \ i ( X i ) (cid:3)(cid:12)(cid:12) > ε (cid:3) ≤ δ. Here (cid:98) µ ϕ \ i is the ensembled leave-one-out model defined in Algorithm 2. To gain intuition for thisassumption, we consider the mean aggregation as a canonical example, and verify that it satisfies Assumption S1for any bounded base regression method. Proposition S1.
Suppose that ϕ ( · ) = Mean ( · ) is the mean aggregation, and suppose the base regressionmethod R always outputs a bounded regression function, i.e., R maps any training data set to a function (cid:98) µ taking values in a bounded range [ (cid:96), u ] , for fixed constants (cid:96) < u . Then, for any ε > , Assumption S1 issatisfied with δ = 2 exp (cid:32) − √ Bθε ( u − (cid:96) ) (cid:33) + exp (cid:32) − (cid:0) √ B − (cid:1) θ (cid:33) , where θ = (1 − n ) m in the case of bagging (i.e., the S b ’s are bootstrapped samples, drawn with replacement),or θ = 1 − mn in the case of subagging (i.e., the S b ’s are subsamples drawn without replacement).Proof. By exchangeability, it suffices to prove the statement for a single i ∈ { , . . . , n } . Fix i , and let B i denote the number of S b ’s not containing the index i , i.e., B i = (cid:80) Bb =1 (cid:2) S b (cid:54)(cid:51) i (cid:3) . For any fixed γ ∈ (0 , , P ∗ (cid:104) (cid:12)(cid:12)(cid:98) µ Mean \ i ( X i ) − E ∗ [ (cid:98) µ Mean \ i ( X i )] (cid:12)(cid:12) > ε (cid:105) ≤ P ∗ (cid:104) (cid:12)(cid:12)(cid:98) µ Mean \ i ( X i ) − E ∗ [ (cid:98) µ Mean \ i ( X i )] (cid:12)(cid:12) > ε and B i ≥ γθB (cid:105) + P ∗ (cid:104) B i < γθB (cid:105) . As for our earlier notation E ∗ , here P ∗ denotes the probability with respect to the random collection ofsubsamples or bootstrapped samples S , . . . , S B conditional on the data ( X , Y ) , . . . , ( X n , Y n ) .The arithmetic mean aggregation function, ϕ Mean , satisfies sup y ,...,y Bi ,y (cid:48) b ∈ [ (cid:96),u ] | ϕ Mean ( y , . . . , y B i ) − ϕ Mean ( y , . . . , y b − , y (cid:48) b , y b +1 , . . . , y B i ) | ≤ u − (cid:96)B i , b = 1 , . . . , B i . Thus, by McDiarmid’s inequality [Boucheron et al., 2013, Theorem 6.2], P ∗ (cid:104) (cid:12)(cid:12)(cid:98) µ Mean \ i ( X i ) − E ∗ [ (cid:98) µ Mean \ i ( X i )] (cid:12)(cid:12) > ε (cid:12)(cid:12)(cid:12) B i ≥ γθB (cid:105) ≤ (cid:18) − Bγθε ( u − (cid:96) ) (cid:19) . (S5)On the other hand, B i ∼ Binomial(
B, θ ) , where θ = (cid:0) − n (cid:1) m for sampling with replacement, or θ = 1 − mn for sampling without replacement. The Chernoff inequality for the binomial [Boucheron et al., 2013, Chapter2] implies P [ B i < γθB ] ≤ exp (cid:32) − B (1 − γ ) θ (cid:33) . (S6)Combining (S5) and (S6), P ∗ (cid:2)(cid:12)(cid:12)(cid:98) µ Mean \ i ( X i ) − E ∗ [ (cid:98) µ Mean \ i ( X i )] (cid:12)(cid:12) > ε (cid:3) ≤ (cid:18) − Bγθε ( u − (cid:96) ) (cid:19) + exp (cid:32) − B (1 − γ ) θ (cid:33) . γ = 1 / √ B yields P ∗ (cid:2)(cid:12)(cid:12)(cid:98) µ Mean \ i ( X i ) − E ∗ [ (cid:98) µ Mean \ i ( X i )] (cid:12)(cid:12) > ε (cid:3) ≤ (cid:32) − √ Bθε ( u − (cid:96) ) (cid:33) + exp (cid:32) − (cid:0) √ B − (cid:1) θ (cid:33) . To study coverage properties under this notion of stability, we first define the ε -inflated J+aB predictioninterval as (cid:98) C ε -J+aB α,n,B ( x ) = (cid:2) q − α,n { (cid:98) µ ϕ \ i ( x ) − R i } − ε, q + α,n { (cid:98) µ ϕ \ i ( x ) + R i } + ε (cid:3) , for any ε ≥ . We then have the following guarantee: Theorem S1.
Under ( ε, δ ) -ensemble stability (Assumption S1), the ε -inflated jackknife+-after-bootstrapprediction interval satisfies P (cid:104) Y n +1 ∈ (cid:98) C ε -J+aB α,n,B ( X n +1 ) (cid:105) ≥ − α − √ δ. Delaying the proof to the end of this section, we discuss the difference between Theorem S1 and Theorem 1.Theorem 1 gives an assumption-free lower-bound of − α on the coverage, but the probability is over allrandomness, including that of the Binomial draw. By contrast, the ≈ − α coverage guarantee of Theorem S1holds for a fixed value of B used to run Algorithm 2, but at the cost of requiring the ensemble algorithm R ϕ,B to satisfy ensemble stability.In contrast to the above notion of ensemble stability, Steinberger and Leeb [2018] and Barber et al. [2019]study coverage of jackknife and jackknife+ under algorithmic stability of (non-ensembled) regression method R . This requires R to satisfy P (cid:2)(cid:12)(cid:12)(cid:98) µ \ i ( X n +1 ) − (cid:98) µ ( X n +1 ) (cid:12)(cid:12) > ε ∗ (cid:3) ≤ δ ∗ . (S7)This can be interpreted as saying that a prediction (cid:98) µ ( X n +1 ) is only slightly perturbed if a single point isremoved from the training. In this setting, jackknife and jackknife+ are each shown to guarantee ≈ − α coverage.We can take a lifted version of this assumption, requiring that (S7) holds on the ensembled models onaverage over the resampling process: P (cid:2)(cid:12)(cid:12) E ∗ (cid:2)(cid:98) µ ϕ \ i ( X n +1 ) − E ∗ [ (cid:98) µ ϕ ( X n +1 )] (cid:3)(cid:12)(cid:12) > ε ∗ (cid:3) ≤ δ ∗ . (S8)Note that one can have ensemble stability without algorithmic stability. For example, a bounded regressionmethod may still be highly unstable relative to adding/removing a single data point (thus violating algorithmicstability), while Proposition S1 ensures that ensemble stability will hold under mean aggregation.When an ensemble method satisfies both Assumption S1 and the lifted version of algorithmic stability (S8),then the following result yields a coverage bound that is ≈ − α , rather than ≈ − α as in Theorem S1: Theorem S2.
Assume that ( ε, δ ) -ensemble stability (Assumption S1) holds, and in addition, the ensembledmodel satisfies algorithmic stability on average over the resampling process, i.e., (S8) . Then the ε +2 ε ∗ -inflatedJ+aB prediction interval satisfies P (cid:104) Y n +1 ∈ (cid:98) C (2 ε +2 ε ∗ ) -J+aB α,n,B ( X n +1 ) (cid:105) ≥ − α − √ δ − √ δ ∗ . Proof of Theorems S1 and S2.
Put (cid:98) µ ∗ ϕ \ i = E ∗ [ (cid:98) µ ϕ \ i ] , where we recall that E ∗ is the expectation conditionalon the data. Let R ∗ ϕ denote the regression algorithm mapping data to (cid:98) µ ∗ ϕ , i.e., R ∗ ϕ : { ( X i , Y i ) } ni =1 (cid:55)→ E ∗ (cid:2) ϕ (cid:0)(cid:8) R (cid:0) { ( X i b,(cid:96) , Y i b,(cid:96) ) } m(cid:96) =1 (cid:1) : b = 1 , . . . , B (cid:48) , B (cid:48) ∼ Binomial(
B, θ ) (cid:9)(cid:1)(cid:3) , where θ = θ ( n ) = (1 − n +1 ) m (in the case of sampling with replacement) or θ = θ ( n ) = 1 − mn +1 (in the case ofsampling without replacement). We emphasize that n here refers to the size of the sample being fed through R ∗ ϕ (e.g., each leave-one-out regressor (cid:98) µ ∗ ϕ \ i is trained on n − data points, so in this case, θ = θ ( n − ).16 ∗ ϕ is a deterministic function of the data, since it averages over the random draw of the subsamples orbootstrapped samples. Furthermore, it is a symmetric regression algorithm (i.e., satisfies Assumption 2).Fix some α (cid:48) ∈ (0 , to be determined later, and construct the jackknife+ interval (cid:98) C ∗ J+ α (cid:48) ,n ( x ) = (cid:104) q − α (cid:48) ,n { (cid:98) µ ∗ ϕ \ i ( x ) − R ∗ i } , q + α (cid:48) ,n { (cid:98) µ ∗ ϕ \ i ( x ) + R ∗ i } (cid:105) , where R ∗ i = | Y i − (cid:98) µ ∗ ϕ \ i ( X i ) | is the leave-one-out residual for this new regression algorithm. By Barber et al.[2019, Theorem 1], (cid:98) C ∗ J+ α (cid:48) ,n satisfies P (cid:104) Y n +1 ∈ (cid:98) C ∗ J+ α (cid:48) ,n ( X n +1 ) (cid:105) ≥ − α (cid:48) . If, additionally, R ∗ ϕ satisfies the algorithmic stability condition (S7) given in Section B of the main paper,then by Barber et al. [2019, Theorem 5], the ε ∗ -inflated jackknife+ interval (cid:98) C ∗ ε ∗ -J+ α (cid:48) ,n ( x ) = (cid:104) q − α (cid:48) ,n { (cid:98) µ ∗ ϕ \ i ( x ) − R ∗ i } − ε ∗ , q + α (cid:48) ,n { (cid:98) µ ∗ ϕ \ i ( x ) + R ∗ i } + 2 ε ∗ (cid:105) satisfies P (cid:104) Y n +1 ∈ (cid:98) C ∗ ε ∗ -J+ α (cid:48) ,n ( X n +1 ) (cid:105) ≥ − α (cid:48) − √ δ ∗ . Next, by Assumption S1, for each i = 1 , . . . , n , P (cid:104)(cid:12)(cid:12)(cid:12)(cid:98) µ ϕ \ i ( X i ) − (cid:98) µ ∗ ϕ \ i ( X i ) (cid:12)(cid:12)(cid:12) > ε (cid:105) ≤ δ. (S9)Let α (cid:48) = α − √ δ . By the above argument, to prove the theorems, it suffices to show (cid:98) C ε -J+aB α,n,B ( X n +1 ) ⊇ (cid:98) C ∗ -J+ α (cid:48) ,n ( X n +1 ) with probability at least − √ δ in order to complete the proof of Theorem S1, or (cid:98) C (2 ε +2 ε ∗ ) -J+aB α,n,B ( X n +1 ) ⊇ (cid:98) C ∗ ε ∗ -J+ α (cid:48) ,n ( X n +1 ) with probability at least − √ δ in order to complete the proof of Theorem S2. In fact, these two claims are proved identically—we simplyneed to show that (cid:98) C (2 ε +2 ε (cid:48) ) -J+aB α,n,B ( X n +1 ) ⊇ (cid:98) C ∗ ε (cid:48) -J+ α (cid:48) ,n ( X n +1 ) with probability at least − √ δ (S10)with the choice ε (cid:48) = 0 for Theorem S1, or ε (cid:48) = ε ∗ for Theorem S2.To complete the proof, then, we establish the bound (S10). Suppose (cid:98) C (2 ε +2 ε (cid:48) ) -J+aB α,n,B ( X n +1 ) (cid:54)⊇ (cid:98) C ∗ ε (cid:48) -J+ α (cid:48) ,n ( X n +1 ) .We have that either q + α,n (cid:8)(cid:98) µ ϕ \ i ( X n +1 ) + R i (cid:9) + 2 ε < q + α (cid:48) ,n (cid:110)(cid:98) µ ∗ ϕ \ i ( X n +1 ) + R ∗ i (cid:111) or q − α,n (cid:8)(cid:98) µ ϕ \ i ( X n +1 ) − R i (cid:9) − ε > q − α (cid:48) ,n (cid:110)(cid:98) µ ∗ ϕ \ i ( X n +1 ) − R ∗ i (cid:111) , where R i = | Y i − (cid:98) µ ϕ \ i ( X i ) | . As in the proof of Barber et al. [2019, Theorem 5], this implies that (cid:12)(cid:12)(cid:12)(cid:98) µ ϕ \ i ( X n +1 ) − (cid:98) µ ∗ ϕ \ i ( X n +1 ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:98) µ ϕ \ i ( X i ) − (cid:98) µ ∗ ϕ \ i ( X i ) (cid:12)(cid:12)(cid:12) > ε for at least (cid:100) (1 − α )( n + 1) (cid:101) − ( (cid:100) (1 − α (cid:48) )( n + 1) (cid:101) − ≥ √ δ ( n + 1) many indices i = 1 , . . . , n . Thus, P (cid:104) (cid:98) C (2 ε +2 ε (cid:48) ) -J+aB α,n,B ( X n +1 ) (cid:54)⊇ (cid:98) C ∗ ε (cid:48) -J+ α (cid:48) ,n ( X n +1 ) (cid:105) ≤ P (cid:34) n (cid:88) i =1 (cid:104)(cid:12)(cid:12)(cid:12)(cid:98) µ ϕ \ i ( X n +1 ) − (cid:98) µ ∗ ϕ \ i ( X n +1 ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:98) µ ϕ \ i ( X i ) − (cid:98) µ ∗ ϕ \ i ( X i ) (cid:12)(cid:12)(cid:12) > ε (cid:105) ≥ √ δ ( n + 1) (cid:35) ≤ √ δ ( n + 1) n (cid:88) i =1 P (cid:104)(cid:12)(cid:12)(cid:12)(cid:98) µ ϕ \ i ( X n +1 ) − (cid:98) µ ∗ ϕ \ i ( X n +1 ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:98) µ ϕ \ i ( X i ) − (cid:98) µ ∗ ϕ \ i ( X i ) (cid:12)(cid:12)(cid:12) > ε (cid:105) ≤ n √ δ ( n + 1) P (cid:104)(cid:12)(cid:12)(cid:12)(cid:98) µ ϕ \ n ( X n +1 ) − (cid:98) µ ∗ ϕ \ n ( X n +1 ) (cid:12)(cid:12)(cid:12) > ε (cid:105) . P (cid:104) (cid:98) C (2 ε +2 ε (cid:48) ) -J+aB α,n,B ( X n +1 ) (cid:54)⊇ (cid:98) C ∗ ε (cid:48) -J+ α (cid:48) ,n ( X n +1 ) (cid:105) ≤ √ δ, implying (S10). This completes the proofs for Theorems S1 and S2. C Jackknife-minmax-after-bootstrap
As in Barber et al. [2019], we may also consider the jackknife-minmax-after-bootstrap , which constructs theinterval (cid:98) C J-mm-aB α,n,B ( x ) = (cid:104) min i (cid:98) µ ϕ \ i ( x ) − q − α,n { R i } , max i (cid:98) µ ϕ \ i ( x ) + q + α,n { R i } (cid:105) . The original jackknife-minmax satisfies − α lower bound on the coverage, and the same modification of thejackknife+ proof is applicable here, ensuring a − α lower bound on the coverage of the jackknife-minmax-after-bootstrap with the same caveat of a random B . However, as for the non-ensembled version, the methodis too conservative, and is not recommended for practice. D Supplementary experiments
D.1 Additional details about the experimental setup
We give precise definitions of the ensembles and the jackknife-type constructions considered.Let R ϕ,B denote an ensemble regression method (Algorithm 1) that first generates B bootstrap replicatesof a given training data set, calls on a base regression method R to fit a model to each generated data set,after which the results are aggregated through ϕ .For R , we use one of Ridge , RF , or NN :• For Ridge , we set the penalty at λ = 0 . (cid:107) X (cid:107) , where (cid:107) X (cid:107) is the spectral norm of the training datamatrix.• For RF , we used the RandomForestRegressor method from scikit-learn with 20 trees grown foreach random forest using the mean absolute error criterion and the bootstrap option turned off, withdefault settings otherwise.• For NN , we used the MLPRegressor method from scikit-learn with the L-BFGS solver and thelogistic activation function, with default settings otherwise.For ϕ , we use one of Mean , Median , or
Trimmed mean :• Mean is the arithmetic mean, i.e., ϕ ( y , . . . , y k ) = k − (cid:80) ki =1 y k .• Median is the middle value of a list, i.e., for odd k , ϕ ( y , . . . , y k ) is the ( k + 1) / -th smallest numberof the list { y , . . . , y k } , for even k , the average of the k/ -th and the ( k + 2) / -th smallest.• Trimmed mean is the arithmetic mean of the middle 50% of a list, i.e., ϕ ( y , . . . , y k ) = ( (cid:100) . k (cid:101) −(cid:98) . k (cid:99) ) − (cid:80) (cid:100) . k (cid:101) i = (cid:98) . k (cid:99) +1 y ( i ) , where y (1) ≤ · · · ≤ y ( k ) is the sorted list. We used scipy.stats.trim_mean with proportioncut=0.25 .The J+aB was defined in Algorithm 2. J+ensemble refers to the following application of the jackknife+[Barber et al., 2019] with the ensemble learner R ϕ,B :18 lgorithm 4 J+ensemble for i = 1 , . . . , n do Compute (cid:98) µ J+ensemble \ i = R ϕ,B ( { ( X j , Y j ) } nj =1 ,j (cid:54) = i ) Compute the residual, R J+ensemble i = | Y i − (cid:98) µ J+ensemble \ i ( X i ) | . end for Compute the ensembled prediction interval: at each x ∈ R , (cid:98) C J+ensemble α,n,B ( x ) = (cid:104) q − α,n { (cid:98) µ J+ensemble \ i ( x ) − R J+ensemble i } , q + α,n { (cid:98) µ J+ensemble \ i ( x ) + R J+ensemble i } (cid:105) . J+non-ensemble applies the jackknife+ to the base learning algorithm R without ensembling: Algorithm 5
J+non-ensemble for i = 1 , . . . , n do Compute (cid:98) µ J+non-ensemble \ i = R ( { ( X j , Y j ) } nj =1 ,j (cid:54) = i ) Compute the residual, R J+non-ensemble i = | Y i − (cid:98) µ J+non-ensemble \ i ( X i ) | . end for Compute the non-ensembled prediction interval: at each x ∈ R , (cid:98) C J+non-ensemble α,n ( x )= (cid:104) q − α,n { (cid:98) µ J+non-ensemble \ i ( x ) − R J+non-ensemble i } , q + α,n { (cid:98) µ J+non-ensemble \ i ( x ) + R J+non-ensemble i } (cid:105) . D.2 Other aggregation methods
In Section 5, we reported the results for ϕ = Mean . Here, we report the results for ϕ = Median or Trimmedmean . For the data sets and the base regression methods we looked at,
Median or Trimmed mean did notbehave much differently from
Mean . Thus, we continue to see similar patterns: Figures S1 and S3 look verymuch like Figure 1, and Figures S2 and S4, like Figure 2.
D.3 Effect of fixing B for stable ensembles In Appendix B, we saw that for stable ensembles, concentration with respect to the resampling measureimplies that the J+aB using a fixed value of B will retain some coverage guarantee as long as enough modelsare being aggregated. As an example of stable ensembles, we gave bagging.Here, we provide numerical support for the conclusion by running the J+aB, either with B fixed at avalue ( J+aB fixed ) or with B drawn at random ( J+aB random ).• For
J+aB fixed , we used B = 50 .• For J+aB random , we drew B ∼ Binomial( (cid:101) B, (1 − n +1 ) m ) with (cid:101) B = [50 / (1 − n +1 ) m ] , where [ · ] refersto the integer part of the argument. This ensures that the total number of models being fitted in J+aBrandom is matched on average to the total in
J+aB fixed .We fixed α = 0 . for the target coverage of 90%. We used n = 200 observations for training, samplinguniformly without replacement to create a training-test split for each trial. The results presented here are from10 independent training-test splits of each data set. We otherwise repeat the setup of Section 5, which includesthe three data sets, the three choices for the base regression method, or the three choices of aggregation.The results are summarized in Figures S5–S10. They show that for the data sets and the ensemble methodsconsidered, J+aB fixed and
J+aB random behave essentially the same. Although we only prove ensemblestability for ϕ = Mean , because both
Median and
Trimmed mean act like
Mean , at least for the data setsand the base regression methods we have looked at, the same patterns are replicated for the two alternativeaggregation methods. 19 .4 Wall clock time comparisons
In Tables S1–S3, we report the average wall-clock times for all data set, base regression method, andaggregation method combinations for m = 0 . n . As these measurements are expected to vary depending onthe hardware and implementation details, it is the relative magnitudes that are of interest. Our experimentswere run on a standard MacBook Air 2018 laptop.The results lend extra support to the conclusion that the J+aB is a computationally efficient alternativeto J+ensemble , which yields more precise confidence intervals than
J+non-ensemble when ensemblingimproves the precision of the base regression method.Table S1: Average wall-clock times in seconds over 10 independent splits of
MEPS ( m = 0 . n and samplingwith replacement). Ensemble R ϕ J+aB J+ensemble J+non-ensembleRidge Mean 0.2 2.1 0.4Median 0.5 2.8Trimmed mean 0.5 2.7RF Mean 3.0 61.6 4.9Median 3.9 63.1Trimmed mean 3.9 56.4NN Mean 8.8 257.7 14.4Median 10.2 213.8Trimmed mean 10.0 206.9
Table S2: Average wall-clock times in seconds over 10 independent splits of
Blog ( m = 0 . n and samplingwith replacement). Ensemble R ϕ J+aB J+ensemble J+non-ensembleRidge Mean 0.5 6.7 1.5Median 1.1 9.1Trimmed mean 1.2 9.0RF Mean 8.7 191.3 11.1Median 9.6 197.3Trimmed mean 9.7 197.1NN Mean 36.8 835.8 46.4Median 39.4 891.3Trimmed mean 37.7 843.7 .80.91.0 M E P S RIDGE Coverage RF NN B L O G . . . . . m/n C O MM U N I T I E S . . . . . m/n J+ NON-ENSEMBLE J+ ENSEMBLE J+aB . . . . . m/n Figure S1: Distributions of coverage (averaged over each test data) in 10 independent splits using ϕ = Median . The black line indicates the target coverage of − α .21 .05.06.07.08.0 M E P S RIDGE Width RF NN B L O G . . . . . m/n C O MM U N I T I E S . . . . . m/n J+ NON-ENSEMBLE J+ ENSEMBLE J+aB . . . . . m/n Figure S2: Distributions of interval width (averaged over each test data) in 10 independent splits using ϕ = Median . 22 .80.91.0 M E P S RIDGE Coverage RF NN B L O G . . . . . m/n C O MM U N I T I E S . . . . . m/n J+ NON-ENSEMBLE J+ ENSEMBLE J+aB . . . . . m/n Figure S3: Distributions of coverage (averaged over each test data) in 10 independent splits using ϕ = Trimmed mean . The black line indicates the target coverage of − α .23 .05.06.07.08.0 M E P S RIDGE Width RF NN B L O G . . . . . m/n C O MM U N I T I E S . . . . . m/n J+ NON-ENSEMBLE J+ ENSEMBLE J+aB . . . . . m/n Figure S4: Distributions of interval width (averaged over each test data) in 10 independent splits using ϕ = Trimmed mean . 24 .850.900.95 M E P S RIDGE Coverage RF NN0.850.900.95 B L O G . . . . . m/n0.850.900.95 C O MM U N I T I E S . . . . . m/n J+aB Random B J+aB Fixed B . . . . . m/n Figure S5: Distributions of coverage of
J+aB random and
J+aB fixed (averaged over each test data) in10 independent splits using ϕ = Mean . The black line indicates the target coverage of − α .25 .04.5 M E P S RIDGE Width RF NN2.03.04.0 B L O G . . . . . m/n0.40.60.8 C O MM U N I T I E S . . . . . m/n J+aB Random B J+aB Fixed B . . . . . m/n Figure S6: Distributions of interval width of
J+aB random and
J+aB fixed (averaged over each test data)in 10 independent splits using ϕ = Mean . 26 .850.900.95 M E P S RIDGE Coverage RF NN0.850.900.95 B L O G . . . . . m/n0.850.900.95 C O MM U N I T I E S . . . . . m/n J+aB Random B J+aB Fixed B . . . . . m/n Figure S7: Distributions of coverage of
J+aB random and
J+aB fixed (averaged over each test data) in10 independent splits using ϕ = Median . The black line indicates the target coverage of − α .27 .04.55.0 M E P S RIDGE Width RF NN2.03.04.0 B L O G . . . . . m/n0.40.6 C O MM U N I T I E S . . . . . m/n J+aB Random B J+aB Fixed B . . . . . m/n Figure S8: Distributions of interval width of
J+aB random and
J+aB fixed (averaged over each test data)in 10 independent splits using ϕ = Median . 28 .850.900.95 M E P S RIDGE Coverage RF NN0.850.900.95 B L O G . . . . . m/n0.850.900.95 C O MM U N I T I E S . . . . . m/n J+aB Random B J+aB Fixed B . . . . . m/n Figure S9: Distributions of coverage of
J+aB random and
J+aB fixed (averaged over each test data) in10 independent splits using ϕ = Trimmed mean . The black line indicates the target coverage of − α .29 .04.55.0 M E P S RIDGE Width RF NN2.03.04.0 B L O G . . . . . m/n0.40.6 C O MM U N I T I E S . . . . . m/n J+aB Random B J+aB Fixed B . . . . . m/n Figure S10: Distributions of interval width of
J+aB random and
J+aB fixed (averaged over the test data)in 10 independent splits using ϕ = Trimmed mean .30able S3: Average wall-clock times in seconds over 10 independent splits of