Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning
aa r X i v : . [ s t a t . O T ] F e b Bridging Breiman’s Brook:From Algorithmic Modeling to Statistical Learning
Lucas Mentch ∗ and Giles Hooker † February 25, 2021
Abstract
In 2001, Leo Breiman wrote of a divide between “data modeling” and “algorithmic modeling”cultures. Twenty years later this division feels far more ephemeral, both in terms of assigningindividuals to camps, and in terms of intellectual boundaries. We argue that this is largely dueto the “data modelers” incorporating algorithmic methods into their toolbox, particularly drivenby recent developments in the statistical understanding of Breiman’s own Random Forest meth-ods. While this can be simplistically described as “Breiman won”, these same developments alsoexpose the limitations of the prediction-first philosophy that he espoused, making careful sta-tistical analysis all the more important. This paper outlines these exciting recent developmentsin the random forest literature which, in our view, occurred as a result of a necessary blendingof the two ways of thinking Breiman originally described. We also ask what areas statistics andstatisticians might currently overlook.
Twenty years after its initial publication, Breiman’s “Two Cultures” essay (Breiman et al., 2001)remains an entertaining and enlightening read, though we are struck by how uncontroversial newreaders in the modern era may likely find many of its prescriptions. We begin our commentaryby comparing Breiman’s views at the time with current perspectives on data science and arguethat while the field of statistics may have fallen short on some fronts, Breiman would no doubt bepleased with a number of strides made in recent decades. In our view, these developments have beenmade not because of a switch in statistical thinking from data to algorithmic modeling, but becauserigorous statistical analysis has allowed these methods to be understood within more traditionalstatistical frameworks. We detail how the willingness to expand the statistical toolbox withoutsacrificing fundamental principles has led to a rapid development of important recent studies onrandom forests and urge the statistical community to continue these efforts to better understandcutting-edge methodologies in deep learning. Throughout the remainder of this paper, quotes andreferences including only a page number are in reference to Breiman’s original essay (Breiman et al.,2001). “ If our goal as a field is to use data to solve problems, then we need to move away from exclusivedependence on data models and adopt a more diverse set of tools. ” – Page 199 ∗ University of Pittsburgh, [email protected] † Cornell University, [email protected] be data science, no one does more of thejob (Yu, 2014). Top statistics departments in the United States including, for example, CarnegieMellon, Yale, and Cornell University, now include ‘Data Science’ in the official department name.Thus, while the field of statistics has perhaps not expanded quickly enough in the minds of some,we think that Breiman would be very pleased with the willingness to embrace new directionsexemplified in recent decades. “ There is an old saying “If all a man has is a hammer, then every problem looks like a nail.” Thetrouble for statisticians is that recently some of the problems have stopped looking like nails. ” – Page204The foundation of Breiman’s arguments lie in the distinction between what he refers to as“data modeling” as opposed to “algorithmic modeling”. Data modeling, according to Breiman, isthe traditional statistical modeling context wherein a basic framework like Y = f ( X )+ ǫ is assumedand simple parametric models like linear regression, logistic regression, or the Cox model are utilizedto estimate f . The algorithmic modeling culture, by contrast, utilizes tools like decision trees andneural networks and “ considers the inside of the box complex and unknown. Their approach is to nd a function f ( x ) – an algorithm that operates on x to predict the responses y ” (p. 199). “ Thecommitment to data modeling ,” Breiman argues, “ has prevented statisticians from entering newscientific and commercial fields where the data being gathered is not suitable for analysis by datamodels ” (p. 200). “
The best available solution to a data problem might be a data model; then againit might be an algorithmic model ” (p. 204).But while the distinction between data and algorithmic models may have seemed natural at thetime, 20 years later, it feels arbitrary and unnecessary: Y = f ( X ) + ǫ is a data model! Throughoutthe essay, Breiman seems to repeatedly conflate “data models” and “parametric models”. It’s notclear why Breiman seemed to feel that tools like decision trees cannot or should not be implementedin conjunction with at least some minor model assumptions, and his reference to Grace Wahba’sfoundational, and very statistical, work on nonparametric smoothing as outside of mainstreamstatistics, at least in 2001, seems strange.Breiman himself goes so far as to admit that one the primary goals in statistics is to obtaininformation about the underlying data generating mechanism (p. 203, 214). And insofar as thisgoal is concerned, it’s important to stress that neither the data or algorithmic modeling approachis of any value when carried out at the extreme. A data model, though perhaps interpretable, holdsno value if the model itself is not reflective of the process that generated the original data. On theother hand, an algorithmic model that amounts to an impenetrable black-box, while it may produceaccurate predictions, is merely a potentially more efficient version of nature and therefore of limitedscientific value – we haven’t learned anything about the mechanisms and variables driving thosepredictions.In our view, the enormous progress that has been made by statisticians in recent decades camenot from abandoning the traditional principals of data modeling, but by merging the culturesBreiman describes. The willingness on the part of statisticians to expand the traditional modelingtoolbox without sacrificing core data analytic and uncertainty quantification principles has led totremendous gains in our knowledge and understanding of modern supervised learning techniques.Random forests – the apple of Breiman’s eye – is certainly one of the methodologies that hasbenefitted most from this merged mindset. ‘ So forests are A+ predictors. But their mechanism for producing a prediction is difficult to under-stand. ... Using complex predictors may be unpleasant, but the soundest path is to go for predictiveaccuracy first, then try to understand why. ” – Page 208The second sentence in the above quote is perhaps the best summary of the later half ofBreiman’s essay and arguably the fundamental principle that has stimulated the recent progress inunderstanding the success and inner workings of black-box learning algorithms like random forests.At the time, much of the traditional statistical modeling culture revolved around formulatingparametric models and making continual tweaks until the model seemed to fit best. The processbegins with an explicit and interpretable model and modifications are made to try and improveaccuracy. Breiman – stemming from his belief that “ ...the models that best emulate nature in termsof predictive accuracy are also the most complex and inscrutable. ” (p. 208) – urged statisticiansto consider the process in reverse: begin with a an accurate but potentially obscure model andperform follow-up investigation to uncover the impact of individual predictors on the response.Breiman kickstarts these efforts in his work with random forests and out-of-bag (oob) variableimportance, demonstrating that “ a model does not have to be simple to provide reliable information bout the relation between predictor and response variables ” (pp. 209-211). Interestingly, in doingso, he himself implicitly begins to utilize traditional data modeling thinking in order to extractinsight from complex models. In the following sections, we detail how similar thinking has recentlyled to a rapid advancement in our understanding of random forests as well as the ways in whichthey can be reliably used to further our understanding of the underlying processes. “ My kingdom for some good theory. ” – Page 205Breiman’s preference for algorithmic models did not lead him to abandon statistical theory;indeed his essay explicitly recognizes the theoretical developments underpinning both smoothingsplines and support vector machines (although it unaccountably ignores a large statistical literatureon other nonparametric smoothing methods) and he sketched such an analysis of random forests.The two decades since then have seen statistics slowly fill in theoretical details particularly aboutthe random forest method. It’s unclear whether Breiman’s essay, or their reliable “off-the-shelf”performance (or a combination of the two) that motivated a statistical focus on random forests inparticular, but our own interest in these methods was also influenced by recognizing an appealingmacroscopic structure that could be used to develop mathematical theory.The pace of these developments at times felt glacial, due somewhat to the difficulty of developingan analysis that accounts for the greedy splitting strategies involved in building trees. Theoreticalheadway has come from making a number of modifications, some arguably beneficial, others purelyfocussed on mathematical tractability:1. Methods that provide guarantees on the (geometic) size of the leaves of individual trees, usu-ally obtained by introducing randomly chosen splits. Originally, this involved generating treestructures largely at random (Biau et al., 2008; Biau and Devroye, 2010; Biau, 2012). Morerecently, the regularity conditions of Wager and Athey (2018) can be enforced by choosinga split at random with some probability while otherwise allowing data-informed splittingmethods to be used. Controlling leaf dimensions provides a similar control over within-leafbias, where the tree represents the underlying relationship as constant and hence forms animportant component of consistency results.2. Methods that enforce independence between tree structures and leaf values. We think of thisas avoiding the problem of trees “chasing order statistics” or finding structures that isolate ex-ceptionally (and randomly) large values of the response. Choosing tree structures completelyat random and letting the data dictate leaf values naturally provides this and facilitates arepresentation of random forests as kernel methods (Scornet, 2016). Wager and Athey (2018)propose splitting the data between samples used to determine tree structure and those as-signing values to leaves. The theoretical advantage of this separation is that it allows thevalues given to leaves to be unbiased for the average response in the leaf. This completes acontrol of bias when combined with controlling the size of the leaves. The authors also arguefor performance improvements in addition to theoretical guarantees based on reducing edgebias. However, this improvement is counteracted by reducing the size of the sample availablefor building trees (and hence their depth) and we thus regard that trade-off as unclear. Thestandard CART split criterion (Breiman et al., 1984) seems to remain the overwhelminglymost popular choice in practice.3. Changing the sampling structure. Breiman’s original random forest algorithm obtained trees4rom bootstrap samples of the data. The last five years has seen a switch to employing sub-samples instead. This is motivated by the connection between random forests based on sub-samples and classical U -statistic constructions and allowed both Mentch and Hooker (2016)and Wager and Athey (2018) to develop central limit theorems for the predictions that ran-dom forests give. These have been refined by analyses in Peng et al. (2019) who also providedBerry-Esseen results and Zhou et al. (2019) who extended the analysis to subsampling with replacement and thereby obtained improved variance estimates. Scornet et al. (2015) alsoutilize subsampling to obtain what is arguably the “best” consistency result to date, in thatit covers estimators very closely resembling the form of random forests originally proposedby Breiman (2001).The switch to subsampling can be regarded as assisting Breiman’s objective of reducing corre-lation between trees – we’d like subsamples large enough that trees can be built to reasonabledepths (to improve tree-level accuracy) but small enough that between-tree correlation ismitigated. While this creates another tuning parameter, Zhou et al. (2019) found that usingsubsamples with replacement also reduced the sensitivity of results to this fraction, and thatthe optimal subsample size is not always that of the training data.We note that, similar to more classical non-parametric smoothing results, these central limittheorems are not necessarily centered at a target f ( X ) = E ( Y | X ). The U -statistic argumentapplies quite generally to averages of any sort of learner, but specific analysis is requiredto make E ˆ f ( X ) − f ( X ) small enough to be ignorable for inference (e.g. Wager and Athey,2018). In nonparametric smoothing, removing the bias from a central limit theorem generallyinvolves under-smoothing relative to the optimal predictive model (Eubank, 1999). While wereadily acknowledge the value in being able to perform strictly formal inference (see belowfor examples) we also believe that an analysis of stability has value in itself.These developments allow us to provide results that state, for a given point of interest x , ζ n ( x ) − ( ˆ f n ( x ) − η n ( x )) ∼ N (0 ,
1) (1)where particular expressions for the mean and standard deviation functions η n ( x ) and ζ n ( x ) dependon particulars of the modification of the random forest algorithm. However, the general frameworkcreates the basis for the formalized inference we describe below.None of the above modifications are strictly necessary in the sense that (1) is provably falsewithout them. As noted above, Scornet et al. (2015) have produced a consistency result forestimators nearly identical to the original random forest algorithm. Such minor modificationshave largely been motivated as a means to mathematical tractability, although in some casesthey may also provide genuine improvements in predictive or inferential performance. Theseideas have also been accompanied with a wide range of modifications to random forests thatrepurpose them for estimating individual treatment effects (Wager and Athey, 2018), survivalanalysis (Gordon and Olshen, 1985; Segal, 1988; Davis and Anderson, 1989; Hothorn et al., 2004;Zhu and Kosorok, 2012; Steingrimsson et al., 2016), quantile regression (Meinshausen, 2006) andmultivariate outcomes (Segal and Xiao, 2011), with similar analyses to those we outline above.Breiman’s ideas about using co-in-leaf status as a measurement of proximity between points havealso been taken up to provide local models (Athey et al., 2019), prototypes (Tan et al., 2020) andprediction intervals (Lu and Hardin, 2019); see also Zhang et al. (2019) for prediction intervalsbased on conformal methods.These results have all been focussed around random forest-like methods, but they are hardlythe only machine learning methods available. Though not shy about promoting his own work,5reiman’s essay specifically discusses boosting (Freund et al., 1996; Freund and Schapire, 1997;Friedman, 2001) as a rival methodology with real potential. These methods received early attentionfrom analogies to the LASSO in linear models (Friedman, 2001) and initial analysis in B¨uhlmann(2002). Boosting is able to accommodate a wide array of model forms and restrictions (Breiman’scomments suggest he might not approve of the latter), some of which are explored in Hothorn et al.(2010); Lou et al. (2013); Zhou and Hooker (2018). Nonetheless, theoretical analysis remains rudi-mentary and a long way from (1). Early analyses in this direction in Zhou and Hooker (2018) try tomake boosting look more like random forests and may provide a starting point; Bayesian methods(Chipman et al., 2010; Roˇckov´a and Saha, 2019; Hill, 2011) have also seen significant development. “ Framing the question as the choice between accuracy and interpretability is an incorrect interpre-tation of what the goal of a statistical analysis is. The point of a model is to get useful informationabout the relation between the response and predictor variables. Interpretability is a way of gettinginformation. But a model does not have to be simple to provide reliable information about the re-lation between predictor and response variables; neither does it have to be a data model. ” – page 209While statistical understanding of the properties of random forests has focused on analyzingmodels that are increasingly similar to Breiman’s original ideas, this has not been the case forthe tools that he advocates for deriving knowledge from them. He acknowledges that the tools headvocates result in algebraically-complex models that do not allow human inspection, but proposesthat we should first find a model that predicts well and then reverse-engineer what it does. Thisproposal has been directly attacked by some (e.g. Rudin (2018)), but the specifics of Breiman’sapproaches have also been shown to be potentially misleading. In fact, identifying, or testing, theimportant covariates in a random forest is a challenging procedure with many popular methodsexhibiting distinct flaws. Here we review some of these and point to ways in which variable im-portance can be validly assessed, and point to how improved statistical understandings of randomforests have contributed to these recommendations.Breiman’s original proposals for eliciting information about the relation between predictor andresponse variables was via his out-of-bag variable importance measures. These proceed on a tree-by-tree basis:1. Choose those data points that were not included in the sample that the tree constructed.Call this data set Z .2. Form a new data set Z π by permuting the values of the variable of interest ( X throughoutthe discussion below, with X = ( X , X − )).3. Compare the predictive accuracy of using the current tree to predict the response in Z tothat when using Z π .The idea of permuting features can be applied generally, but specifically permuting out-of-bag datadoes, by definition, require an ensemble structure constructed via bootstrapping or at least someform of resampling. It also means we measure accuracy of individual members of the ensemble,not the ensemble as a whole. This general approach was critiqued by numerous studies includingboth Strobl et al. (2008) and Hooker (2007) for also breaking any relationships between X and X − . In particular it may lead to values in Z π that are far from any observed data points, therebyevaluating the ways in which a random forest extrapolates at least as much as the signal that itpicks up. Despite these early warnings, the method proved (and unfortunately remains) popular6s a variable screening tool; a stereotypical example can be found in D´ıaz-Uriarte and De Andres(2006) where features measured as least important are successively removed. This prompted usto repeat and formalize the critique with further studies in Hooker and Mentch (2019). In thosestudies, as in many of the earlier works, we found that pairs of correlated features were likely tohave their importance estimates inflated, despite having an impact that would conventionally beregarded as smaller. We stand by our recommendation not to use these methods.The other common measure of variable importance among random forest implementations isspecific to tree structures and measures the improvement in accuracy that is obtained each timethe tree is spit (e.g. Friedman, 2001); a variable accumulates the improvements associated withthose splits that use it. This has been critiqued for favoring variables with more potential splitpoints in Strobl et al. (2007) and we admit to having been surprised by the size of the effectfound in simulation in Zhou and Hooker (2019). This, along with other papers (Li et al., 2019;Loecher, 2020) have attempted to correct for this bias by using out-of-sample data, but the preciseinterpretation of this measure is also not clear outside of the very specific cases in Scornet (2020).Some of the flaws in the above methods are surprising while others, in retrospect, appearforeseable. So are there ways to more appropriately and rigorously assess variable importance?Yes, though none are as computationally or algorithmically simple as those above.The primary problem with the out-of-bag approaches outlined above is that models are con-structed only on the original data – inspections are made by altering (permuting) test data, butnever by altering the training data and reconstructing the model. We refer to this general ideaas “permute-and-repredict” in Hooker and Mentch (2019). In most practical applications, how-ever, when one asks whether a particular feature – say X – is “important”, they are really askingwhether the model estimate (learning procedure, algorithm) could be just as accurate without it.Investigating this question directly thus necessarily involves the more computationally intensiveidea of rebuilding the model under alterations to the training data.In our earlier work on random forests (Mentch and Hooker, 2016), we proposed a hypothesis testfor variable importance based on exactly this idea: we construct one random forest on the originaldata, then another where X is removed from the dataset and compare the difference in predictionsbetween the two models. Noticing that predictions are sometimes affected by the dimension ofthe feature space, we advocated that the test be carried by permuting X in the data used bythe second random forest rather than removing it from the model entirely. In more recent work,Coleman et al. (2019) proposed a more computationally efficient version of this test that measuresthe difference in accuracy between the two forests rather than the difference in raw predictions.This kind of idea as also appeared in the conformal inference literature where Lei et al. (2018)proposed LOCO (Leave-One-Covariate-Out) tests that follow exactly the same intuition and in amodel-agnostic test recently proposed in Williamson et al. (2020).While the “drop-and-rebuild” and “permute-and-rebuild” approaches both improve upon the“permute-and-repredict” framework, neither is perfect in practical settings because in additionto breaking the relationship between the feature and response, we also break the relationshipbetween the feature and all remaining features. A natural further fix is thus to replace X witha copy generated from its distribution conditional on X − . This was, to our knowledge, firstadvocated in Strobl et al. (2008) and holds some resemblance to the new literature on knockoffs(Barber et al., 2015), conditional randomization tests (Cand`es et al., 2018; Liu and Janson, 2020),and holdout randomization tests (Tansey et al., 2018). However, while such an approach is certainlytheoretically appealing, the task of generating conditional simulations of X | X − is non-trivial; anumber of potential methods exist to do this, but more work needs to be done here and we anticipate This critique also motivated the GUIDE procedure (Loh, 2002). X beyondthat provided by the other covariates . We could, instead, simulate X − | X , or some subset of thecovariates. To this end, Shapley values (Lundberg and Lee, 2017) have been suggested as repre-senting the marginal improvement in predictive accuracy given by X , averaged over the orderof inclusion. Alternative functional ANOVA methods (Hooker, 2007), relate variable importancesto statistically-familiar types of sums of squares, allow the examination of interactions and canaccount for covariate distributions, albeit at a potentially prohibitive computational cost. “ an algorithmic model can produce more and more reliable information about the structure of therelationship between inputs and outputs than data models. ” – page 200While statistical theory and insight has helped to understand and improve our methods ofextracting information from a random forest, we can go further and start to use random forestsexplicitly for statistical inference. The CLT in (1) can be used to provide confidence intervals aboutindividual predictions (Mentch and Hooker, 2016) or other quantities averaged over some covari-ate distribution (Wager and Athey, 2018). However, ideas such as variable importance suggesthypotheses about structure along the lines of H : η ( X , X − ) = η ∗ ( X − ) (2)for some function η ∗ . Here we can use the extension of (1) to multivariate or process distributionsover the input space. In particular, Mentch and Hooker (2016) provide formal tests of this form byevaluating the differences D ( X ) = 1 B X ( T ωb ( X ) − T πb ( X )) (3)between trees T ω build with the original data, and T π that used a permuted version of X . Sincethe D ( X ) have a joint multivariate central limit theorem, we can form a χ -square test of theirvalues at a set of equation points.Alternatively, the functional ANOVA methods in Hooker (2007) were turned into formal testsin Mentch and Hooker (2017). Specifically, we interpret the hypothesis in (2) as H : η ( X i, , X k, − ) = η ( X j, , X k, − )over a set of covariate values ( X i, , X i, − ). This creates a grid of function evaluations for which H is a linear contrast that can be again tested via a multivariate χ -square statistic. The sametechniques can be extended to examine interactions by testing hypotheses of the form H ′ : η ( X i, , X j, , X k, − (1 , ) = η ( X i, , X k, − (1 , ) + η ( X j, , X k, − (1 , ) . Any of these testing frameworks require evaluation at a large number of covariate values, makingvariance estimates unstable, and Mentch and Hooker (2017) resorted to random projection methods(Srivastava et al., 2016) to stabilize them. They also rely heavily on asymptotic normality resultsand on a fixed set of evaluation points. Coleman et al. (2019) considered using the two forests thatcontribute to (3), but developed a permutation test based on randomly swapping trees between8orests and measuring the change in differences in error, meaning that far fewer trees need to beconstructed, thus allowing for very large test sets.We hope to show by these methodological developments that Breiman’s notions of “algorithmicmodeling” can be accommodated and used readily within classical statistical paradigms, and referto Chipman et al. (2010) for parallel Bayesian developments. Some of the methods have describedlean heavily on at least the ensemble structure of random forests, but others are more generalprovided a result of the form of (1) is available. When “algorithmic models” are understood asnon-parametric regression methods, the chasm that Breiman perceived within statistics can becrossed with only a few well-placed stepping stones. “ But when a model is fit to data to draw quantitative conclusions... the conclusions are about themodel’s mechanism, and not about nature’s mechanism. ” – Page 202.Despite the enormous progress made on the statistical front in recent years, the question thathas perhaps proved the most elusive is also the simplest: Why do random forests seem to workso well in practice? In their recent review paper, Biau and Scornet (2016) note that “presentresults are insufficient to explain in full generality the remarkable behavior of random forests.”And indeed, studies have long demonstrated the surprisingly robust accuracy of the random forestmethod. The most recent and impressive large-scale empirical comparison of methods was providedin Fern´andez-Delgado et al. (2014) where the authors compare a total of 179 classifiers across 121datasets – the entirety of the UCI database (Dua and Graff, 2017) at the time. They found thatnot only did random forests perform the best overall, but 3 of the top 5 classifiers were some variantof random forests, leading them to declare that “the random forest is clearly the best family ofclassifiers.”Until very recently however, few if any explanations for random forest success had been of-fered that pushed beyond Breiman’s original intuition of reducing the between-tree correlation.Wyner et al. (2017) focused on classification settings and hypothesized that their success was dueto interpolation – a kind of purposeful overfitting that, in their view, was beneficial because perhapsreal-world data are not as noisy as statisticians generally believe.Mentch and Zhou (2020b) were critical of this explanation, noting in particular that in regres-sion settings, the opposite appears to be true: random forests appear to be most useful in settingswhere the signal-to-noise ratio (SNR) is low. The authors proposed an alternative explanationbased on degrees of freedom wherein the additional node-level randomness in random forests im-plicitly regularizes the procedure, making it particularly attractive in low SNR settings. Basedon this idea, they demonstrate empirically that the advantage of random forests over bagging ismost pronounced at low SNRs and disappears at high SNRs; when the trees in random forestsare replaced with linear models, these kind of implicit regularization claims and can be provedformally. This finding deals a serious blow to the widely held belief shared by both Breiman (2001)and Wyner et al. (2017) that random forests simply “are better” than bagging as a general rule.Perhaps most surprisingly, the authors also demonstrate that when a linear model selection proce-dure like forward selection incorporates a similar kind of random feature availability at each step,the resulting models can be substantially more accurate and even routinely outperform classicalregularization methods like the lasso.In follow-up work, Mentch and Zhou (2020a) build on this implicit regularization to intro-duce another surprising result: the predictive accuracy of both bagging and random forests cansometimes be dramatically and systematically improved (particularly at low SNRs) by including9dditional noise features in the model which, by construction, hold no additional information aboutthe response beyond that contained in the original features. The authors refer to these kinds ofprocedures as augmented bagging and show that even the formal hypothesis testing proceduresabove will, correctly but counterintuitively, recognize large groups of noise features as making asignificant contribution to the predictive accuracy of the model.This last point is quite surprising and easy to misinterpret; readers seeing this for the firsttime might reasonably ask whether these kinds of hypothesis tests should really be trusted if theyroutinely identify noise features as predictively significant. We stress, however, that these tests aremerely evaluating whether there is a significant improvement in accuracy when additional noisefeatures are added to the model. This does indeed appear to be the case in some settings – modelswith additional noise features really can make the models more accurate. Thus, in those settings,by identifying such noise features as predictively helpful, it’s important to realize that these testsare not making an error. Rather, as discussed at length in Mentch and Zhou (2020a), it is not thetests themselves that are wrong but the common (mis)interpretations of those tests.However, this observation does confound two separate questions: whether the feature improvedthe predictive accuracy of a model, and whether it is associated with the response by real-worldprocesses. Alternatively: is the feature uniquely helpful: do we need to include the same measure-ments, or would the current value of the microwave background radiation be just as useful? ”Whatis your model for the data?” – the question Breiman objected to – in fact becomes essential: arewe measuring a property of the world that generated the data, or the algorithm used to analyzeit? Both are reasonable to measure, but they are not the same, and can’t be assessed by the samemethods. Crucially, Breiman’s standard of predictive accuracy, while applying to properties of thealgorithm, is not necessarily indicative of real-world processes. We expect that deeper studies in thegeneral area of variable importance will continue to remain an important topic of study in futureyears as data becomes larger and more complex and black-box models are increasingly relied uponas a result. “ It maybe revealing to understand how I became a member of the small second culture... My expe-riences as a consultant formed my views about algorithmic modeling.
Tachycineta bicolor ) fall migration. Specifically, daily maximum temperaturewas expected to drive tree swallow migration through its (unobserved) effect on the abundance offlying insects that make up the greatest part of their food source.10odeling migration effects is made challenging by complex spatio-temporal effects along withhigh-order interactions among other landscape and land-use features (Zuckerberg et al., 2016).This makes machine learning models particularly useful, and these have be extensive explored, forexample in Fink et al. (2020). In this project we had a data set of 25727 observations collectedafter day 200 on each year from 2008 through 2013 in six wildlife refuges along the north easternUnited States coast. These observations recorded the presence or absence of tree swallow sightings,the time and day of the recording, and the time and distance effort put into data collection. Wealso collected local landcover and elevation summaries from remote sensing systems as well as thedaily maximum temperature, providing a total of 30 variables for each observation.Within this data set, daily maximum temperature is strongly associated with day of year to theextent that there is little evidence of its predictive utility. This relationship also makes permutingthe maximum temperature values problematic in creating unrealistically warm December daysand very chilly Augusts. Instead, we replace maximum temperature with its difference from theaverage for that day and location over the six year period. This orthogonalization isolates the effectof weather from other temporal drivers, both allowing a machine learning algorithm to make useof the signal and reducing the effect of permuting the difference.We assessed whether maximum temperature had a discernible affect on tree swallow abundanceby examining the differences in predictions between tree ensembles obtained from the original dataand those obtained using a permuted version of maximum temperature difference. This was doneat 25 time-locations pairs in each of the wildlife refuges in which we used a multivariate version of(1) to develop a χ test for each 25-vector of prediction differences. This revealed strong evidenceof an effect at 4 of the regions with much attenuated evidence in the most northerly and southerlyregions where we expect migration timing and climate to already reduce the role of temperature.This study illustrates the combining of the modeling cultures that Breiman describes: statisticalconsiderations play a key role in designing the covariates we used, the development of our tests, andour assessment of evidence, but we avoided the need for extensive and complex models by usingalgorithmic tools: treating these as statistical models is not a large leap. Debates around causalityand fairness in machine learning have equally demonstrated the need to combine schools in thisway and we expect the gap that Breiman perceived to become increasingly invisible. “ This is a fascinating enterprise, and I doubt if data models are applicable. Yet I would enter thisin my ledger as a statistical problem. ” – page 214Breiman’s essay clearly displays his frustration with the conservativism of a statistics communityunwilling to engage with unfamiliar ideas and methods. This same sentiment is not unfamiliar tous but it is also hardly unique to statistics, as many statisticians who have received reviews fromthe computer science conferences promoted by Breiman can attest. We do not subscribe to thesentiment that “Statistics lost Machine Learning”, but the statement does lead us to ask whatother areas we might currently be going overlooked.An obvious answer is “deep learning” (viewed at the time of Breiman’s essay as being somewhatplayed out) and indeed we are aware of starkly little statistical involvement in this intensely hotarea of machine learning research where the greatest performance gains over the past decade –Breiman’s primary metric of validity – have come from. The work that we are aware of (e.g.Barron and Klusowski, 2018; Bai et al., 2020) focusses on shallow networks and largely ignores thealgorithms used to learn weights. This seems analogous to the original work on random forests and11e hope that it will similarly lead to steadily more practical results, but also expect progress willrequire new ways of representing the problem.However, rather than the mathematical properties of neural networks, the aspect we most worryabout statisticians ignoring is the type of data where these tools excel: images, text, sound and othernatural data types that are both highly structured – e.g. pixels are next to other pixels – and whereindividual numbers have little to no individual interpretive value. (The large statistical literatureon medical imagining requires making images comparable on a value-by-value or at least region-by-region level. Similar comments can be made about inference with many other non-Euclideanobjects where defining a distance appears to be a starting point.) We have seen little statisticalinterest in the way deep learning is applied to these types of complex data objects where questionsabout manifold structure, the stability of explanation methods, or structuring models that extractinformation could all benefit from statistical perspectives, even as we freely acknowledge that we,ourselves, have not contributed.Outside of these specifics, we expect that Breiman would have welcomed the broadening ofinterests associated with the field of Data Science. Statistics has a long history of worrying aboutthe design or causal issues in data provenance and the biases associated with data collection, anda somewhat more limited engagement with data privacy. Much more of this is sorely needed asthe emerging issues in ethical AI are making clear. We think the adjacent areas of data mergingand data cleaning are both ripe for more formal study; developing methods and analyzing theirconsequences. There are surely other tasks of which we haven’t thought, but we really do need toencourage the discipline to expand beyond our collective intellectual comfort zone. “ The roots of statistics, as in science, lie in working with data and checking theory against data. Ihope in this century our field will return to its roots. There are signs that this hope is not illusory.Over the last ten years, there has been a noticeable move toward statistical work on real world prob-lems and reaching out by statisticians toward collaborative work with other disciplines. I believe thistrend will continue and, in fact, has to continue if we are to survive as an energetic and creativefield. ” – Page 214Looking around today at the emerging field of data science and the resulting shifts within thestatistics community, it’s easy to see the impact of Breiman’s thinking. As faculty members instatistics departments who are active in data science education at both the undergraduate andgraduate levels, we frequently interact with students who are new to the field, many of whom arereading Breiman’s seminal essay for the first time. In the vast majority of cases, we find that theseyoung statistics students are puzzled as to why it was ever considered a controversial work. “
Isn’tit obvious that this is what we’re supposed to be doing?”
Ironically, perhaps the best indication ofBreiman’s lasting impact through his “Two Cultures” essay is its lessened relevance today.Just as the field of statistics has begrudgingly opened it’s mind to problems outside the clas-sical modeling framework, computer scientists and machine learning researchers deserve credit forexpanding their horizons as well. A newfound interest in interpretability (e.g. “explainable AI”),causality, and fairness is commonly on display at each of the most notable machine learning con-ferences, each of which hearkens back to fundamental ideas and philosophical considerations intraditional statistical modeling.It’s worth noting that while the boundaries of these fields may have softened, it remains thecase that their respective directions are not unified. Computer scientists remain largely driven by12he “what” (predictive accuracy); statisticians the “how” and “why.” We still observe a separationof concerns: global interpretations versus local explanations mirrors scientific knowledge versusactions. In our view, however, so long as we resist dogmatic aversions to alternative ways ofthinking, unity shouldn’t be seen as a necessity for (or perhaps even the optimal route to) continuedprogress.Finally, there are likely some statisticians who, like David Cox’s commentary in the originalpaper, remain skeptical of the recent trends in our field towards data science, fearing that it signifiesa surrender of the theoretical school of statistics and traditional modeling. We do not think so;rather it is the realization that both sides might have something to tell the other. Rather than seeingthese trends as shifting away from classical statistical ideas, we urge skeptical colleagues to considerthese recent developments an expansion of those ideas. This, we firmly believe, is the central themeof Breiman’s essay and is ultimately what allowed for the exciting recent developments detailed inthe previous sections.
References
Athey, S., J. Tibshirani, S. Wager, et al. (2019). Generalized random forests.
Annals of Statis-tics 47 (2), 1148–1178.Bai, J., Q. Song, and G. Cheng (2020). Efficient variational inference for sparse deep learning withtheoretical guarantee. arXiv preprint arXiv:2011.07439 .Barber, R. F., E. J. Cand`es, et al. (2015). Controlling the false discovery rate via knockoffs.
Annalsof Statistics 43 (5), 2055–2085.Barron, A. R. and J. M. Klusowski (2018). Approximation and estimation for high-dimensionaldeep learning networks. arXiv preprint arXiv:1809.03090 .Biau, G. (2012). Analysis of a Random Forests Model.
The Journal of Machine Learning Re-search 98888 , 1063–1095.Biau, G. and L. Devroye (2010). On the layered nearest neighbour estimate, the bagged nearestneighbour estimate and the random forest method in regression and classification.
Journal ofMultivariate Analysis 101 (10), 2499–2518.Biau, G., L. Devroye, and G. Lugosi (2008). Consistency of Random Forests and Other AveragingClassifiers.
The Journal of Machine Learning Research 9 , 2015–2033.Biau, G. and E. Scornet (2016). A random forest guided tour.
Test 25 (2), 197–227.Breiman, L. (2001). Random Forests.
Machine Learning 45 , 5–32.Breiman, L. et al. (2001). Statistical modeling: The two cultures (with comments and a rejoinderby the author).
Statistical science 16 (3), 199–231.Breiman, L., J. Friedman, C. J. Stone, and R. Olshen (1984).
Classification and Regression Trees (1st ed.). Belmont, CA: Wadsworth.B¨uhlmann, P. L. (2002). Consistency for l2 boosting and matching pursuit with trees and tree-typebasis functions. In
Research report/Seminar f¨ur Statistik, Eidgen¨ossische Technische Hochschule(ETH) , Volume 109. Seminar f¨ur Statistik, Eidgen¨ossische Technische Hochschule (ETH).13and`es, E., Y. Fan, L. Janson, and J. Lv (2018). Panning for gold:‘model-x’knockoffs for highdimensional controlled variable selection.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) 80 (3), 551–577.Chipman, H. A., E. I. George, R. E. McCulloch, et al. (2010). Bart: Bayesian additive regressiontrees.
The Annals of Applied Statistics 4 (1), 266–298.Coleman, T., L. Mentch, D. Fink, F. A. La Sorte, D. W. Winkler, G. Hooker, and W. M. Hochachka(2020). Statistical inference on tree swallow migrations with random forests.
Journal of the RoyalStatistical Society: Series C (Applied Statistics) 69 (4), 973–989.Coleman, T., W. Peng, and L. Mentch (2019). Scalable and efficient hypothesis testing with randomforests. arXiv preprint arXiv:1904.07830 .Davis, R. B. and J. R. Anderson (1989). Exponential survival trees.
Statistics in medicine 8 (8),947–961.D´ıaz-Uriarte, R. and S. A. De Andres (2006). Gene selection and classification of microarray datausing random forest.
BMC bioinformatics 7 (1), 1–13.Dua, D. and C. Graff (2017). UCI machine learning repository.Eubank, R. L. (1999).
Nonparametric regression and spline smoothing . CRC press.Fern´andez-Delgado, M., E. Cernadas, S. Barro, and D. Amorim (2014). Do we need hundredsof classifiers to solve real world classification problems?
The Journal of Machine LearningResearch 15 (1), 3133–3181.Fink, D., T. Auer, A. Johnston, V. Ruiz-Gutierrez, W. M. Hochachka, and S. Kelling (2020).Modeling avian full annual cycle distribution and population trends with citizen science data.
Ecological Applications 30 (3), e02056.Freund, Y. and R. E. Schapire (1997). A decision-theoretic generalization of on-line learning andan application to boosting.
Journal of computer and system sciences 55 (1), 119–139.Freund, Y., R. E. Schapire, et al. (1996). Experiments with a new boosting algorithm. In icml ,Volume 96, pp. 148–156. Citeseer.Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.
Annals ofstatistics , 1189–1232.Gordon, L. and R. A. Olshen (1985). Tree-structured survival analysis.
Cancer treatment re-ports 69 (10), 1065–1069.Hill, J. (2011, 03). Bayesian nonparametric modeling for causal inference.
Journal of Computationaland Graphical Statistics 20 , 217–240.Hooker, G. (2007). Generalized functional anova diagnostics for high-dimensional functions ofdependent variables.
Journal of Computational and Graphical Statistics 16 (3), 709–732.Hooker, G. and L. Mentch (2019). Please stop permuting features: An explanation and alternatives. arXiv preprint arXiv:1905.03151 . 14othorn, T., P. B¨uhlmann, T. Kneib, M. Schmid, and B. Hofner (2010). Model-based boosting2.0.
Journal of Machine Learning Research 11 , 2109–2113.Hothorn, T., B. Lausen, A. Benner, and M. Radespiel-Tr¨oger (2004). Bagging survival trees.
Statistics in medicine 23 (1), 77–91.Lei, J., M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman (2018). Distribution-freepredictive inference for regression.
Journal of the American Statistical Association 113 (523),1094–1111.Li, X., Y. Wang, S. Basu, K. Kumbier, and B. Yu (2019). A debiased mdi feature importancemeasure for random forests.
Advances in Neural Information Processing Systems 32 , 8049–8059.Liu, M. and L. Janson (2020). Fast and powerful conditional randomization testing via distillation. arXiv preprint arXiv:2006.03980 .Loecher, M. (2020). Unbiased variable importance for random forests.
Communications inStatistics-Theory and Methods , 1–13.Loh, W.-Y. (2002). Regression tress with unbiased variable selection and interaction detection.
Statistica sinica , 361–386.Lou, Y., R. Caruana, J. Gehrke, and G. Hooker (2013). Accurate intelligible models with pairwiseinteractions. In
Proceedings of the 19th ACM SIGKDD international conference on Knowledgediscovery and data mining , pp. 623–631. ACM.Lu, B. and J. Hardin (2019). A unified framework for random forest prediction error estimation. arXiv preprint arXiv:1912.07435 .Lundberg, S. M. and S.-I. Lee (2017). A unified approach to interpreting model predictions. In
NIPS .Meinshausen, N. (2006). Quantile regression forests.
Journal of Machine Learning Research 7 (Jun),983–999.Mentch, L. and G. Hooker (2016). Quantifying uncertainty in random forests via confidence intervalsand hypothesis tests.
The Journal of Machine Learning Research 17 (1), 841–881.Mentch, L. and G. Hooker (2017). Formal hypothesis tests for additive structure in random forests.
Journal of Computational and Graphical Statistics 26 (3), 589–597.Mentch, L. and S. Zhou (2020a). Getting better from worse: Augmented bagging and a cautionarytale of variable importance. arXiv preprint arXiv:2003.03629 .Mentch, L. and S. Zhou (2020b). Randomization as regularization: A degrees of freedom explana-tion for random forest success.
Journal of Machine Learning Research 21 (171), 1–36.Peng, W., T. Coleman, and L. Mentch (2019). Rates of convergence for random forests via gener-alized u-statistics. arXiv preprint arXiv:1905.10651 .Roˇckov´a, V. and E. Saha (2019). On theory for bart. In
Proceedings of Machine Learning Research ,Volume 89, pp. 2839–2848. 15udin, C. (2018). Please stop explaining black box models for high stakes decisions. arXiv preprintarXiv:1811.10154 1 .Scornet, E. (2016). Random forests and kernel methods.
IEEE Transactions on InformationTheory 62 (3), 1485–1500.Scornet, E. (2020). Trees, forests, and impurity-based variable importance. arXiv preprintarXiv:2001.04295 .Scornet, E., G. Biau, J.-P. Vert, et al. (2015). Consistency of random forests.
The Annals ofStatistics 43 (4), 1716–1741.Segal, M. and Y. Xiao (2011). Multivariate random forests.
Wiley interdisciplinary reviews: Datamining and knowledge discovery 1 (1), 80–87.Segal, M. R. (1988). Regression trees for censored data.
Biometrics , 35–47.Srivastava, R., P. Li, and D. Ruppert (2016). Raptt: An exact two-sample test in high dimensionsusing random projections.
Journal of Computational and Graphical Statistics 25 (3), 954–970.Steingrimsson, J. A., L. Diao, A. M. Molinaro, and R. L. Strawderman (2016). Doubly robustsurvival trees.
Statistics in medicine 35 (20), 3595–3612.Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008). Conditional variableimportance for random forests.
BMC bioinformatics 9 (1), 307.Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn (2007). Bias in random forest variableimportance measures: Illustrations, sources and a solution.
BMC bioinformatics 8 (1), 25.Tan, S., M. Soloviev, G. Hooker, and M. T. Wells (2020). Tree space prototypes: Another look atmaking tree ensembles interpretable. In
Proceedings of the 2020 ACM-IMS on Foundations ofData Science Conference , pp. 23–34.Tansey, W., V. Veitch, H. Zhang, R. Rabadan, and D. M. Blei (2018). The holdout randomizationtest: Principled and easy black box feature selection. arXiv preprint arXiv:1811.00645 .Wager, S. and S. Athey (2018). Estimation and inference of heterogeneous treatment effects usingrandom forests.
Journal of the American Statistical Association 113 (523), 1228–1242.Williamson, B. D., P. B. Gilbert, N. R. Simon, and M. Carone (2020). A unified approach forinference on algorithm-agnostic variable importance. arXiv preprint arXiv:2004.03683 .Wyner, A. J., M. Olson, J. Bleich, and D. Mease (2017). Explaining the success of adaboost andrandom forests as interpolating classifiers.
The Journal of Machine Learning Research 18 (1),1558–1590.Yu, B. (2014). Ims presidential address: Let us own data science. IMS Annual Meeting, Sydney,Australia, July 9-14.Zhang, H., J. Zimmerman, D. Nettleton, and D. J. Nordman (2019). Random forest predictionintervals.
The American Statistician .Zhou, Y. and G. Hooker (2018). Boulevard: Regularized stochastic gradient boosted trees and theirlimiting distribution. arXiv preprint arXiv:1806.09762 .16hou, Z. and G. Hooker (2019). Unbiased measurement of feature importance in tree-based meth-ods. arXiv preprint arXiv:1903.05179 .Zhou, Z., L. Mentch, and G. Hooker (2019). v -statistics and variance estimation. arXiv preprintarXiv:1912.01089 .Zhu, R. and M. R. Kosorok (2012). Recursively imputed survival trees. Journal of the AmericanStatistical Association 107 (497), 331–340.Zuckerberg, B., D. Fink, F. A. La Sorte, W. M. Hochachka, and S. Kelling (2016). Novel seasonalland cover associations for eastern north american forest birds identified through dynamic speciesdistribution modelling.