[PDF] (f)RFCDE: Random Forests for Conditional Density Estimation and Functional Data

Abstract

Random forests is a common non-parametric regression technique which performs well for mixed-type unordered data and irrelevant features, while being robust to monotonic variable transformations. Standard random forests, however, do not efficiently handle functional data and runs into a curse-of dimensionality when presented with high-resolution curves and surfaces. Furthermore, in settings with heteroskedasticity or multimodality, a regression point estimate with standard errors do not fully capture the uncertainty in our predictions. A more informative quantity is the conditional density p(y | x) which describes the full extent of the uncertainty in the response y given covariates x. In this paper we show how random forests can be efficiently leveraged for conditional density estimation, functional covariates, and multiple responses without increasing computational complexity. We provide open-source software for all procedures with R and Python versions that call a common C++ library.

Full PDF

((f)RFCDE : Random Forests for Conditional Density Estimationand Functional Data

Taylor Pospisil and Ann B. LeeDepartment of Statistics & Data ScienceCarnegie Mellon UniversityPittsburgh, PA 15289, USA

Abstract

Random forests is a common non-parametric regres-sion technique which performs well for mixed-typeunordered data and irrelevant features, while be-ing robust to monotonic variable transformations.Standard random forests, however, do not eﬃcientlyhandle functional data and runs into a curse-of-dimensionality when presented with high-resolutioncurves and surfaces. Furthermore, in settings withheteroskedasticity or multimodality, a regressionpoint estimate with standard errors do not fully cap-ture the uncertainty in our predictions. A more in-formative quantity is the conditional density p ( y | x )which describes the full extent of the uncertainty inthe response y given covariates x . In this paper weshow how random forests can be eﬃciently leveragedfor conditional density estimation, functional covari-ates, and multiple responses without increasing com-putational complexity. We provide open-source soft-ware for all procedures with R and Python versionsthat call a common C++ library. Conditional density estimation (CDE) is the estima-tion of the density p ( y | x ) where we condition theresponse y on observed covariates x . In a predictioncontext, CDE provides a more nuanced accountingof uncertainty than a point estimate or prediction in- terval, especially in the presence of heteroskedasticor multimodal responses. Conditional density esti-mation has proven useful in a range of applications;e.g., econometrics [Li and Racine, 2007], astronomy[Group] and likelihood-free inference [Izbicki et al.,2018].We extend random forests [Breiman, 2001] to themore challenging problem of CDE, while inheritingthe beneﬁts of random forests with respect to inter-pretability, mixed-data types, feature selection, anddata transformations. We take advantage of the factthat random forests can be viewed as a form of adap-tive nearest-neighbor method with the aggregatedtree structures determining a weighting scheme. Thisweighting scheme can then be used to estimate quan-tities other than the conditional mean; in this casethe entire conditional density . As shown in Section3.4, random forests can also be adapted to take the functional nature of curves into account by splittingin the domain of the curves.Other existing random forest implementationssuch as quantileregressionForests [Meinshausen,2006] and trtf [Hothorn and Zeileis, 2017] can beused for CDE. However, the ﬁrst procedure does notalter the splits of the tree, and the second implemen-tation does not scale to large sample sizes. Neithermethod handles functional covariates and multipleresponses. In Section 3 we show that our methodachieves lower CDE loss in competitive computa-tional time for several examples.1 a r X i v : . [ s t a t . C O ] J un n brief, the main contributions of this paper are1. Our trees are optimized to minimize a condi-tional density estimation loss while remainingcomputationally eﬃcient; hence overcoming thelimitations of the usual regression approach dueto heteroskedasticity and multimodality.2. We provide new procedures and public software(R and Python wrappers to a common C++ li-brary) for joint conditional density estimationand functional data; this opens the door for in-terpretable models and uncertainty quantiﬁca-tion for a range of modern applications involv-ing diﬀerent types of complex data from multiplesources. To construct our base conditional density estimatorwe follow the usual random forest construction with akey modiﬁcation in the loss function, while retainingalgorithms with linear complexity. Here we describeour base algorithm; the extension to functional co-variates is described in Section 3.4.At their simplest, random forests are ensemblesof regression trees. Each tree is trained on a boot-strapped sample of the data. The training processinvolves recursively partitioning the feature spacethrough splitting rules taking the form of splittinginto the sets { X i ≤ v } and { X i > v } for a particu-lar feature X i and split point v . Once a partitionbecomes small enough (controlled by a tuning pa-rameter), it becomes a leaf node and is no longerpartitioned.For prediction we use the tree structure to calculateweights for the training data from which we performa weighted kernel density estimate using “nearby”points. This is analogous to the regression case whichwould perform a weighted mean.Borrowing the notation of Breiman [2001] andMeinshausen [2006], let θ t denote the tree structurefor a single tree. Let R ( x, θ t ) denote the region offeature space covered by the leaf node for input x .Then for a new observation x ∗ we use t -th tree tocalculate weights for each training point x i as w i ( x ∗ , θ t ) = [ X i ∈ R ( x ∗ , θ t )] (cid:80) ni =1 [ X i ∈ R ( x ∗ , θ t )] . We then aggregate over trees, setting w i ( x ∗ ) = T − (cid:80) Tt =1 w i ( x ∗ , θ t ). The weights are ﬁnally used forthe weighted kernel density estimate (cid:98) p ( y | x ∗ ) = 1 (cid:80) ni =1 w i ( x ∗ ) n (cid:88) i =1 w i ( x ∗ ) K h ( Y i − y ) (1)where K h is a kernel function integrating to one. Thebandwidth can be selected using plug-in methods orthrough tuning based upon a validation set. Up tothis point we have the same approach as Meinshausen[2006].Our departure from the standard random forestalgorithm is the criterion for choosing the splits ofthe partitioning. In regression contexts, the splittingvariable and split point are often chosen to minimizethe mean-squared error loss. For CDE, we insteadchoose splits that minimize a loss speciﬁc to CDE [Izbicki and Lee, 2017] L ( p, (cid:98) p ) = (cid:90) (cid:90) ( p ( y | x ) − (cid:98) p ( y | x )) d y d P ( x ) . where P ( x ) is the marginal distribution of X .This loss is the L error for density estimationweighted by the marginal density of the covariates.To conveniently estimate this loss we can expand thesquare and rewrite the loss as L ( p, (cid:98) p ) = E X (cid:20)(cid:90) (cid:98) p ( y | X ) d y (cid:21) − E X,Y [ (cid:98) p ( Y | X )]+ C p (2)with C p as a constant which does not depend on (cid:98) p .The ﬁrst expectation is with respect to the marginaldistribution of X and the second with respect to thejoint distribution of X and Y . We estimate these ex-pectations by their empirical expectation on observeddata.To provide intuition why switching the loss func-tion is desirable: consider the example in Figure 1:we have two stationary distributions separated by2 transition point at x = 0 .

5: the left is a normaldistribution centered at zero and the right is a mix-ture of two normal distributions centered at 1 and-1. The clear split for a tree is the transition point:however, because the generating distribution’s con-ditional mean is constant for all x the mean-square-error loss ﬁts to noise and picks a poor split point farfrom x = 0 .

5. The CDE loss on the other hand isminimized at the split point. l ll lll ll l ll l ll l l ll l l ll lll ll ll ll ll ll ll l lll ll ll lll llll l ll lll l l l lll ll l llll l l ll ll l ll ll l l lllll l ll l llll l lllll l l lllll l lll lll l l ll l ll ll l ll l ll l lll ll llll l ll ll lll ll ll l lll l l ll ll ll lll l lll lll ll ll llll ll lll ll lllll ll ll ll lll l llll l lll l lll l lll l ll llll l ll ll lll llll ll l llll l lll l ll ll l l l ll l lll ll lll lll lll l ll ll l lll ll ll lll lll llll lll l l lll lll lll l ll l l llll l l ll ll l ll l l ll ll ll ll llll ll l ll ll l ll l lll l ll lll l llll l ll l llll lllll l ll lll ll ll ll lllll l ll ll lll ll ll lll l lll l l llll ll lll lll ll lll l l ll ll ll lll ll lllll l llll ll l lll l lll l lll l ll ll l ll ll llll ll ll ll llll l l l ll lll lll ll l lll l l l l lll l ll lll ll lllll lll l lll l lll ll ll llll ll ll llll l l ll l lll ll l lll lll llll l lll ll ll lll lll l ll ll ll ll l ll l ll l ll l ll l ll ll ll l lllll l l ll l llll l ll lll lll lllllll ll l ll ll lll ll l ll ll l l l l l l ll ll lll ll ll lll ll lll lll l lll l l l lll l lllll ll ll ll ll ll ll l lll ll ll l lll llll l llll lll l ll ll llll ll ll l lll ll ll lll ll lll ll l lll ll ll ll ll ll lll lll ll l ll ll l ll lll ll llll lll ll l ll l l ll l ll l ll lll l l ll l ll l ll l ll l llll l ll l ll l ll ll ll l ll lll l ll ll lll l lll lll llll ll l lll ll ll llll llll l lll l ll ll l l ll llll ll ll ll l lll lll l ll ll ll ll ll ll ll ll l ll l l lll l l l ll ll l lll ll l lll l ll l ll ll

CDEMSE Y N o r m a li z ed Lo ss x Figure 1: (top) Training data with a switch from a uni-modal to bimodal response at x = 0.5; (bottom) the nor-malized losses for each cut-point. MSE overﬁts to smalldiﬀerences in the bimodal regime while CDE loss is min-imized at the true transition point.

While we use kernel density estimates for predic-tions on new observations, we do not use kernel den-sity estimates when evaluating splits: the calcula-tions in Equation 2 would be expensive for KDE withthe term (cid:82) (cid:98) p ( y | x ) d y depending on the O ( n ) pair-wise distances between all training points.For fast computations, we instead use orthogo-nal series to compute density estimates for splitting.Given an orthogonal basis { Φ j ( y ) } such as a cosinebasis or wavelet basis, we can express the density as (cid:98) p ( y | x ) = (cid:80) j (cid:98) β j φ j ( y ) where (cid:98) β j = n (cid:80) ni =1 φ j ( y i ).This choice is motivated by a convenient formula forthe CDE loss associated with an orthogonal seriesdensity estimate (cid:98) L ( p, (cid:98) p ) − C p = − (cid:88) j (cid:98) β j . The above expression only depends upon the quan-tities (cid:110) (cid:98) β j (cid:111) that themselves depend only upon linear sums of φ j ( y i ). This makes it computationally eﬃ-cient to evaluate the CDE loss for each split. To illustrate the potential beneﬁts of optimizing withrespect to a CDE loss, we here compare against anexisting random forest quantile method that mini-mizes an MSE loss. More speciﬁcally, we adapt the quantregForest [Meinshausen, 2006] package in Rto perform conditional density estimation accordingto Equation 1. This would then be equivalent to ourmethod except that the splits for quantregForest minimize mean-squared error rather than CDE loss;thereby providing us a means for speciﬁcally study-ing the eﬀect of training random forests for CDE.We also compare against the trtf package [Hothornand Zeileis, 2017] which trains forests for CDE usingﬂexible parametric families.We generate data from the following model X , Z ∼ Uniform(0 , , S ∼ Multinomial(2 ,

12 ) Y | X, Z, S ∼ (cid:40) Normal( (cid:98) (cid:80) i X i (cid:99) , σ ) S = 1Normal( −(cid:98) (cid:80) i X i (cid:99) , σ ) S = 2where the X i are the relevant features with the Z i serving as irrelevant features. S is an unobserved fea-ture which induces multimodality in the conditionaldensities.Under the true model E [ Y | X, Z ] = 0, so there isno partitioning scheme that can reduce the MSE loss(as the conditional mean is always zero). As such,the trained quantregForest trees behave similarlyto nearest neighbors as splits are eﬀectively chosenat random. We evaluate the three methods on twocriterion: training time and CDE loss on a valida-tion set. All models are tuned with the same forest3arameters ( mtry = 4, ntrees = 1000).

RFCDE has n basis = 15. trtf is ﬁt using a Bernstein basis oforder 5.

RFCDE and quantregForest have bandwidth0.2.Table 1 summarizes the results for three simula-tions of size 1000, 10,000, and 100,000 training ob-servations. We use 1,000 observations for the test setand calculate the CDE loss. We see that

RFCDE per-forms substantially better on CDE loss. We also notethat

RFCDE has competitive training time especiallyfor larger data sets. trtf is only run for the smallestdata set due to its high computational cost.

We next illustrate our methods on applications inastronomy.In order to utilize the statistical information in im-ages of galaxies, we need to estimate how far awaythey are from the Milky Way. A metric for this dis-tance is a galaxy’s redshift, which is a measure of howmuch the Universe has expanded since the galaxyemitted its light. For the vast majority of galax-ies, estimates of redshift probability density functionsare made on the basis of brightness measurements atﬁve wavelengths. This is dubbed photometric red-shift estimation, or photo-z . Due to degeneracies thepdfs may exhibit, e.g., multi-modality; thus photo-z presents a natural venue for showing the eﬃcacyof

RFCDE .We perform a similar CDE methods comparisonas for the synthetic example using realistically sim-ulated photo-z data from LSST DESC[Group]. Wesplit 100,000 training observations into subsets of size1,000, 10,000, and 100,000, and evaluate the CDEloss on 10,000 held-out observations. We compareonly against quantregForest , dropping trtf againfor computational reasons. Both models are tunedwith the same random forest parameters ( mtry = 3, ntrees = 1000).

RFCDE has nbasis = 31. Kerneldensity bandwidths are selected using plug-in esti-mates. The covariates are the magnitudes for the sixLSST ﬁlterbands (ugrizy) together with local diﬀer-ences (u - g, g - r, etc) for a total of 11 covariates.Table 2 summarizes the results: similarly to thesynthetic example we ﬁnd that

RFCDE achieves sub- stantially better CDE loss with competitive compu-tational time.

RFCDE can also target joint conditional density esti-mation, which allows us to capture dependencies inmultivariate responses. This feature is particularlyuseful for estimating Bayesian posterior distributionsin e.g. likelihood free inference Izbicki et al. [2018].The splitting process extends straightforwardly tothe multivariate case through the use of a tensor ba-sis p ( y , y , . . . , y n | x ) = (cid:80) i (cid:80) j β i,j ( x ) φ j ( y i ) whichresults in the same formula for the CDE loss summedover (cid:98) β i,j instead of (cid:98) β j . The density estimation sim-ilarly extends through the use of multivariate kerneldensity estimation in Equation 1. Bandwidth selec-tion can be treated as in the univariate case througheither plug-in estimators or tuning.These extensions allow for multivariate CDE es-timation (with a current eﬀective limit of three re-sponse variables due to the use of the tensor basis).We showcase this by applying RFCDE to a problemrelated to weak lensing , the slight perturbation of agalaxy’s appearance due to the bending of light byintervening masses. Statistical estimates of how cor-related the perturbations are as a function of angulardistance allows one to place constraints on two pa-rameters of the Big Bang model: Ω M , the proportionof the Universe’s mass-energy comprised of mass, and σ , which relates to how galaxies have clustered as theUniverse has evolved. Speciﬁcally, we are interestedin inferring the joint distribution of Ω M and σ givenobserved weak lensing data.This problem provides an illustrative case for theneed to model joint conditional distributions ﬂexibly.There is a degeneracy curve Ω αM σ on which the dataare indistinguishable. In Figure 2 we see that RFCDE can capture this curved ridge structure well whichleads to better parameter constraints; in the examplewe used the

GalSim toolkit [Rowe et al., 2015] tosimulate shear correlation functions under diﬀerentparameter settings.4able 1:

Performance of

RFCDE and competing methods on synthetic data; smaller CDE loss implies better estimates.

Method N CDE Loss (SE) Train Time (seconds) Predict Time (seconds)

RFCDE -0.171 (0.004) 2.31 1.29 quantregForest trtf

RFCDE -0.194 (0.003) quantregForest

RFCDE -0.227 (0.004) quantregForest

Table 2:

Performance of

RFCDE and competing methods for photo-z application; smaller CDE loss implies betterestimates.

Method N CDE Loss (SE) Train Time (seconds) Predict Time (seconds)

RFCDE -2.394 (0.033) 2.24 14.53 quantregForest

RFCDE -4.018 (0.054) quantregForest

RFCDE -5.356 (0.092 ) quantregForest W M s Figure 2:

The posterior distribution of the parametersΩ M and σ exhibits a ridge structure which RFCDE cap-tures due to splitting on CDE loss for joint densities.

Treating functional covariates (like correlation func-tions or images) as unordered multivariate vectors ona grid suﬀers from a curse-of-dimensionality: as theresolution of the grid becomes ﬁner you increase thedimensionality of the data but add little additionalinformation due to high correlation between nearbygrid points.To adapt to this setting we follow the approachof M¨oller et al. [2016] by partitioning the functionalcovariates of each tree in their domain , and passing the mean values of the function for each partition asinputs to the tree.The partitioning is governed by the parameter λ of a Poisson process; starting at the ﬁrst elementof the function evaluation vector we take the nextPoisson( λ ) evaluations as one interval in the par-tition. We then repeat the procedure, starting atthe end of that interval and continuing until wehave partitioned the entire function domain into dis-joint regions { ( l i , h i ) } . Once we have partitioned thefunction domain we take the function mean value (cid:82) h i l i f ( x ) dx within each region as the covariates for5he random forest.We illustrate our functional RFCDE method on spec-troscopic data for 3,000 galaxies from the Sloan Dig-ital Sky Survey. The functional covariates consist ofthe observed spectrum of each galaxy at 3,421 dif-ferent wavelengths; the response is the redshift ofthe galaxy. These spectra can be viewed as a high-resolution version of the photometric data from Sec-tion 3.2.To show the beneﬁts of taking the functional struc-ture of the data into account, we compare the resultsof a base version of

RFCDE as described in Section 2with

RFCDE for functional covariates. We set λ = 50.Both trees are otherwise trained identically: ntrees = 1000, n basis = 31, and the bandwidths are chosenusing plug-in estimators. We train on 2000 examplesand use the remaining 1000 to evaluate the CDE loss.Table 3: Performance of RFCDE for functional data.We see that we obtain both better CDE loss and compu-tational time by taking advantage of the structure of thefunctional data.

Method Train Time (sec) CDE Loss (SE)Functional (0.844)Vector 21.31 -12.578 (0.418)Table 3 shows the CDE loss for both the vector-based and functional-based RFCDE models on theSDSS data.

We obtain substantial gains with a func-tional approach both in terms of CDE loss as wellas computational time. The computational gains areattributed to requiring fewer searches for each splitpoint as the default value of mtry = sqrt(d) is re-duced.

We adapt random forests to conditional density esti-mation through the introduction of an alternative lossfunction for selecting splits in the trees of the randomforest. This loss function is sensitive to changes in thedistribution of responses, which losses based upon re-gression can miss. We exhibit improved performanceand comparable computational speed for a variety of diﬀerent data examples including functional data andjoint conditional distributions. We provide a softwarepackage

RFCDE for ﬁtting this model consisting of aC++ library with R and Python wrappers. Thesepackages along with documentation are available at https://github.com/tpospisi/rfcde . Acknowledgements

We are grateful to Rafael Izbicki and Peter Freemanfor helpful discussions and comments on the paper.This work was partially supported by NSF DMS-1520786.

References

L. Breiman. Random forests.

Machine learning , 45(1):5–32, 2001.L.-D. P. R. W. Group. An assessment of photomet-ric redshift pdf performance in the context of lsst.under review.T. Hothorn and A. Zeileis. Transformation forests. arXiv:1701.02110 , 2017.R. Izbicki and A. B. Lee. Converting high-dimensional regression to high-dimensional condi-tional density estimation.

Electronic Journal ofStatistics , 11(2):2800–2831, 2017.R. Izbicki, A. B. Lee, and T. Pospisil. Abc-cde:Towards approximate bayesian computation withcomplex high-dimensional data and limited simu-lations. arXiv preprint arXiv:1805.05480 , 2018.Q. Li and J. S. Racine.

Nonparametric econometrics:theory and practice . Princeton University Press,2007.N. Meinshausen. Quantile regression forests.

Journalof Machine Learning Research , 7:983–999, 2006.A. M¨oller, G. Tutz, and J. Gertheiss. Random forestsfor functional covariates.

Journal of Chemomet-rics , 30(12):715–725, 2016.6. Rowe, M. Jarvis, R. Mandelbaum, G. M.Bernstein, J. Bosch, M. Simet, J. E. Meyers,T. Kacprzak, R. Nakajima, J. Zuntz, et al. Gal-sim: The modular galaxy image simulation toolkit.