Autocart -- spatially-aware regression trees for ecological and spatial modeling
AAutocart - spatially-aware regression trees for ecological and spatialmodeling
Ethan Ancell, Brennan BeanUtah State University
Many ecological and spatial processes are complex in nature and are not accurately modeled by linear models.Regression trees promise to handle the high-order interactions that are present in ecological and spatialdatasets, but fail to produce physically realistic characterizations of the underlying landscape. The “autocart”(autocorrelated regression trees) R package extends the functionality of previously proposed spatial regressiontree methods through a spatially aware splitting function and novel adaptive inverse distance weightingmethod in each terminal node. The efficacy of these autocart models, including an autocart extensionof random forest, is demonstrated on multiple datasets. This highlights the ability of autocart to modelcomplex interactions between spatial variables while still providing physically realistic representations of thelandscape.
Accurate characterizations of ecological variables have important implications for researchers and stakeholdersin agriculture, watershed sciences, natural resources, and environmental sciences [1]. Periodic land surveysto get accurate measurements for ecological variables are expensive. For example, in 2018 the United StatesDepartment of Agriculture spent over $
70 million dollars in soil surveys [2].Many of these ecological variables are highly interactive. As an example, soil moisture is affected byprecipitation, soil composition, temperature, elevation, and the surrounding ecosystem, yet the individualrelationships between each variable and soil moisture cannot be considered in isolation. The complexityof these high ordered interactions make it extremely difficult to accurately predict with traditional spatialinterpolation models [3]. This is particularly true in Utah, where sharp changes in elevation create drasticchanges in climate over short distances.Traditional spatial methods such as kriging [4] create smooth maps of soil properties, but such approachesfail to adequately handle high-order interactions among explanatory variables. Such shortcomings emphasizethe need to model interactions among the soil variables to mitigate this problem. The need for non-linearpredictions of soil properties is well-documented [5]. Machine-learning techniques promise to remedy under-performing linear models, due to their flexibility in characterizing complex and high-order interactions [6].There is a lot of existing research in the realm of applying machine-learning algorithms to spatial ecologicaldata [7, 8, 9, 10, 11, 12]. Soil property mapping that utilizes machine-learning has also been extensivelystudied [13, 5, 14]. One particularly promising method are regression trees [15], which model high-orderinteractions in a way that is easy to interpret without requiring a massive amount of data like other machine-learning approaches. Unfortunately for spatial data, traditional tree-based algorithms such as regressiontrees have no way of accounting for the relationship between observations that is explained by their spatialdistribution. Coordinate information such as longitude and latitude can be naively included as a predictor ina tree-based model, but this leads to physically unrealistic predictions, often with sharp and discrete jumpsin predictions across the geographical region. An example of these visual artifacts can be seen in Figure 2,which shows an attempt to use the random forests algorithm to model soil moisture in Utah. This figureshows a sharp, discrete jump in the predicted soil moisture at approximately 41.5 degrees latitude.1 a r X i v : . [ q - b i o . Q M ] J a n andom Forests Longitude La t i t ude ° N39 ° N40 ° N41 ° N 113 ° W 112 ° W 111 ° W 110 ° W Figure 1: Prediction map of soil moisture from a default implementation of Random Forests in Utah. Ex-plicitly including longitude/latitude as predictors in the tree-based method yields a clear jump in predictionsmoving from north to south that is not physically realistic.The visual artifacts in Figure 2 are a symptom of the overfitting that is present in machine-learning meth-ods when coordinates are included as explicit predictors [16]. Entirely omitting coordinate information fromthe prediction process may remove the visual artifacts, but ignoring the spatial component can compromiseaccuracy. Coordinate information is especially useful when analyzing data exhibiting spatial autocorrelation,which occurs when observations are related to each other as a function of distance [17]. The discussion anddefinitions surrounding the concept of spatial autocorrelation can be argued to directly follow from Tobler’sfirst law of geography, which states ”everything is related to everything else, but near things are more re-lated than distant things” [18]. Properly accounting for spatial autocorrelation in the modeling process is apowerful way to improve predictions in data that exhibit this property [19].Another problem that exists when analyzing a spatial dataset that covers a large region is the inabilityto make the assumption that the distribution of the variable of interest remains consistent across the entirespace [20, 21]. Regression trees as a means to decompose a global spatial process into smaller local regionshave been studied, including the effort by [21] which discusses the use of hierarchical clustering trees for thispurpose.This paper seeks to extend a machine-learning algorithm to handle coordinate information in an appro-priate way, avoiding the problems of overfitting and visual artifacts while fully economizing on the predictivepower that resides in coordinate information. Initial attempts to create models of this type include regres-sion tree extensions by [22, 23], and a random forest extension by [24]. In the ensuing sections, we proposea tree-based algorithm called “autocart” (autocorrelative regression trees) intended to decompose a globalspatial process into local regions in a hierarchical and automatic way, building upon the work proposed by[22] and [23]. The terminal nodes of the tree can be used for the simulation of a local spatial process usinginverse-distance weighting interpolation. The result is a predictive process that harnesses the predictivepower of both interpolation and regression trees.
In the traditional regression tree algorithm, we create partition rules on a dataset to predict some responsevariable according to a set of splits on the predictor variables [15]. The regression tree algorithm falls underthe paradigm of supervised learning, where we use labeled training data to form rules for the prediction of2ew unlabeled training data.Formally, we predict a class or response Y from predictor variables X , X , . . . , X p by growing a binarytree. We form a prediction with the grown tree by applying a test on one of the predictor variables at eachinternal node of the tree, and depending on the outcome of the test, we move to either the right or left childof the internal node and proceed to apply the next test to one of our predictor variables. The final predictionfor an observation with predictors X , X , . . . , X p is made upon arriving at a terminal node of the binarytree, using the average of the response variable of the training observations that were a part of the terminalnode during training.Figure 2: An example of a regression tree. The response variable for this tree is the percentage of water byvolume found in a particular soil sample.To make a prediction, a set of predictors { X , X , . . . , X p } are passed into the tree. As an example, if wehad predictors slope = 0 .
54 and silt = 32, then we would make a prediction by starting at the top node ofthe tree, going left because slope = 0 . < .
62, and then going right because silt <
42. We would arrive atthe terminal node, and then make the prediction y = 6 .
7. The 13% in the terminal node indicates that 13%of the training data for the building of the tree resides in that node.While growing the tree, the algorithm searches for the decision rule X p < x out of all possible decisionrules, such that predictive accuracy is maximized on the training dataset. This partitioning is done recursively,such that each two child nodes that are created as part of a split form the basis for the next splitting rule.To maximize predictive accuracy on the training data at each step of the algorithm, the decision rule splitsthe data into the two halves that minimize the residual sum of squares (RSS) in the newly formed childrennodes. RSS = (cid:88) i ( y i − ¯ y ) where y i are the observations of the labeled response value at this node, and ¯ y = n (cid:80) i y i is the average ofthe labeled response variables within the node. 3he minimization of RSS described above can be most efficiently calculated by maximizing the variancebetween the nodes. This variance is calculated as SS B = n l (¯ y l − ¯ y ) + n r (¯ y r − ¯ y ) where n l and n r are the sample sizes in the “left” and “right” partitions respectively, ¯ y l and ¯ y r are theaverage values of the response in the left and right children, and ¯ y is the average of the response over allobservations in the parent node where the split is being made. SS B is compared to the total variance of the node if no split had been made: SS T = (cid:88) i ( y i − ¯ y ) The tree makes splits by maximizing the ratio between SS B and SS T . The ratio of SS B and SS T isencapsulated in the so-called “objective function”, the measure of goodness or utility of each potential split. g rss = SS B /SS T (1) Breiman’s CART algorithm discussed in section 3.1 is aspatial, meaning that it does not consider the geo-graphic distribution of the measurements. As discussed in section 2, coordinate information can be includedas predictor variables in the model, but often leads to physically unrealistic predictions as seen in Figure 2.Including coordinate information as explicit predictors in machine-learning models for spatial data also leadsto an overfitting of the training dataset [16].Simply excluding coordinate information from the predictive process may reduce the overfitting of thetraining dataset, but this may curtail the predictive power that lies in coordinate information, especially inspatial datasets with spatial autocorrelation.An appropriate handling of coordinate information is to prevent the tree from making splits based uponcoordinate information, but allow the tree to track coordinate information for each sample to use as part ofthe predictive process. The following techniques assume that the coordinate vector s = ( x, y ) is availablefor all samples in the training dataset. This restructure frames the regression tree prediction process asbeing fueled by both the information encapsulated in the predictor variables { X , X , . . . , X p } as well as thegeographic location expressed as a coordinate vector ( x, y ).An extension to the regression tree algorithm was proposed by [22]. In this extension, the objective func-tion in the regression tree algorithm is formed by a linear combination of the objective function g rss describedin Equation 1 and another objective function g ac that optimizes for measures of spatial autocorrelation inthe partitions. It is defined as g = (1 − α ) g rss + αg ac , α ∈ [0 ,
1] (2)where α is a user-set parameter that weights minimizing the RSS versus maximizing the autocorrelativestatistic. Please note that all cited equations have been converted into a common notation for convenience.A popular statistic intended to measure the presence of spatial autocorrelation in a group of observationsis Global Moran’s I (i.e, Moran’s I) [25]. Moran’s I requires a vector of the response variable Y where wemeasure spatial autocorrelation. We also require a spatial weights matrix that reflects the intensity of thespatial relationship between all pairwise observations in the group. Moran’s I of a response variable Y for agroup of observations G is defined to be I Y = n (cid:80) i (cid:80) j w ij ( y i − ¯ y )( y j − ¯ y ) W (cid:80) i ( y i − ¯ y ) (3)where n is the total number of observations that are indexed by i and j in the group G ; w ij is the spatialweights matrix entry that represents the weight between observations i and j ( w ij is 0 when i = j ), and W = (cid:80) i (cid:80) j w ij . y i is the response variable of interest at observation i , and ¯ y = n (cid:80) i y i is the mean of theresponse variable in the group of observations G .Under the null hypothesis of no spatial autocorrelation, the expected value of Moran’s I for the response Y in group G is E ( I Y ) = − n − . I typically lies in the range [ − , I that are significantly above E ( I ) indicate positivespatial autocorrelation, where values significantly below E ( I ) indicate negative spatial autocorrelation [17].A critical choice is the spatial weighting scheme to produce the spatial weights matrix entries w ij used inMoran’s I. The following Gaussian similarity measure connecting observations i and j is used by [22]. w ij = (cid:40) e − dist( i,j )2 b , if dist( i, j ) < b , if dist( i, j ) ≥ b (4)where dist( · ) is a distance metric between two observations i and j , and b is a spatial-bandwidth parameterthat reflects the spatial distance at which no spatial influence is assumed between observations. Othermethods for choosing w ij in the weights matrix for the calculation of Moran’s I are viable, and the bestscheme for picking w ij may be dataset dependent. For a typical use case, this Gaussian weighting scheme issufficient.In order to simultaneously maximize I Y in both partitions of the data, a fair weighting of I Y in each halfaccording to the number of observations is required. This can be expressed as˜ I Y = n L · I Y L + n R · I Y R n (5)where n L and n R represent then number of observations in arbitrary “left” and “right” partitions. I Y L and I Y R are the value of the Moran’s I statistic from equation 3 for the individual left and right partitions, and n = n L + n R is the total number of observations in the node being split.The nature of ˜ I Y ∈ [ − ,
1] requires a re-scaling to [0 ,
1] to ensure an even match against the residual sumof squares objective function g rss . The autocorrelative objective function can be defined as g ac = ˜ I Y + 12which finds its way into final weighted objective function from equation 2: g = (1 − α ) g rss + αg ac , α ∈ [0 , . By weighting g ac higher with α , the tree is more likely to choose splits that create partitions where theobservations in the partition exhibit the property of spatial autocorrelation. This is in contrast to traditionalregression trees which simply seek to minimize the residual sum of squares. The motivation underlying thisspatial adaptation is that the consideration of g ac at the expense of g rss is worth the investment: the creationof spatial partitions that reflect self-contained units exhibiting a spatial pattern. [22] showed that for somedatasets, weighting g ac higher resulted in gains in predictive accuracy with cross-validation.On the other hand, weighting g ac too high has its own problems. If g rss is not weighted enough, then theprediction that is formed by the mean of the response value in the terminal node ¯ y T is not representativeof the region as a whole and leads to poor predictions. The best strategy is to pick the value of α suchthat an optimal balance is struck between creating partitions that exhibit spatial autocorrelation while stillmaintaining the power of ¯ y T as a baseline prediction.[23] builds upon the work in [22] by proposing a novel data mining framework for geophysical data called interpolative clustering , in which interpolation of the response variable of the training data can be used tosupplement the prediction ¯ y T of a regression tree.As discussed in Section 2, the end goal of these modified regression trees is to decompose the global spatiallandscape into focused sub-regions. [23] observed that the predictive power of the tree can be improved bysimulating a local spatial process contained in a terminal node of the tree using an interpolative methodsuch as inverse-distance weighting. Previously, we formed a prediction for a new observation by consideringthe arithmetic mean of the response variable at the terminal node of the tree ¯ y T . Now, we supplement thatprediction with an interpolative simulation of the spatial pattern of the training observations in the terminalnode.In more formal terms, to make a prediction we run the predictor variables { X , X , . . . , X p } through thesplitting rules of the tree as normal, and note the resulting terminal node T that the prediction falls into.Next, we make the prediction using an interpolation method, using only the training data in the terminal nodeas the reference points for interpolation. We denote the coordinates of the new prediction with s = ( x, y ),5hose final prediction is made withˆ Y ( s ) = (cid:80) n T i =1 w i ( s T i ) y T i (cid:80) n T i =1 w i ( s ) , if dist( s , s Ti ) (cid:54) = 0 for all iy i , if dist( s , s i ) = 0 for some i (6)where n T is the total number of training observations in the terminal node T , y T i are the labeled responsevalues of the training data in T , s T i = ( x i , y i ) are the coordinate locations of the training data in T , anddist( · ) is some spatial distance metric. w i is the spatial weight that is assigned between the interpolated location s and the known responselocations s T i . w i is calculated with inverse distance weighting: w i ( s ) = 1dist( s , s T i ) p (7)where s is the interpolated point, s T i are the known points in T , dist( · ) is a distance metric from the knownpoint to the interpolated point (this may be Euclidean distance in the case of projected data or also greatcircle distance in the case of latitude/longitude coordinates), and p ∈ R + is the power parameter set bythe user. A higher power parameter will highly weight close observations in relation to further observations,compared to a lower power parameter where further observations have more influence in the final predictionˆ Y ( s ). The optimal choice for p is a function of the strength of the underlying spatial distribution in nature,but p = 2 is commonly used.The interpolative step described in equations 6 and 7 is most effective when used in combination withthe objective function g ac . Creating partitions and terminal nodes that exhibit spatial autocorrelation isbeneficial for inverse-distance weighting interpolation as its efficacy relies upon the assumption that thecorrelation between observations decreases as the distance between them increases. In this section we propose further extensions to the methodologies introduced in Section 3.2. We refer to thefinal tree algorithm as the “autocart” algorithm (autocorrelative regression trees), which is publicly availableas an R statistical software package at the following URL:github.com/ethanancell/autocart p In Section 3.2, inverse distance weighting at the terminal nodes of the regression tree is discussed as a wayto supplement the prediction of the tree. Here, we propose a novel approach to picking the power parameter p in equation 7 for the inverse-distance weights.If we consider these spatial regression trees to do the part of automatically decomposing a global landscapeinto local areas where the spatial pattern may differ, then it is unfair to state that the optimal choice ofthe power parameter p is constant across all terminal nodes representing local regions. A local region mayexhibit a stronger dependence between closely neighboring observations than other local regions, necessitatinga varying p .In order to assess the strength of a spatial relationship in some terminal node T , we can reuse the Moran’sI statistic from equation 3. A terminal node exhibiting a stronger correlation between closely neighboringobservations will have a comparatively higher value of I Y . In a region where a stronger correlation is notedbetween closely neighboring observations, it is sensible to pick p to be higher so that the weights definedin equation 7 give greater weight to neighboring observations as compared to far observations. In a regionwhere a weak correlation is noted between closely neighboring observations, it is sensible to pick p to be lowerso that the resulting prediction catches on to the trend of the region as a whole, rather than relying uponinappropriate confidence in the predictive power of close observations, indicated by the low value of I Y .Let M represent the set of I Y i calculated for each terminal node in the regression tree. In the case that I Y i < E ( I Y i ) for the terminal node, there is no value in interpolation as the correlation between observationsclose in space is not significant. The ranged value of p will be only based upon the values I Y i ∈ M wherespatial autocorrelation is observed. We denote this new set with˜ M = { I Y i ∈ M : I Y i > E ( I Y i ) } .
6o make a prediction for the coordinate s = ( x, y ), we first run the accompanying predictor variables { X , X , . . . , X p } through the tree to find which terminal node s belongs to. Once we have found the correctterminal node T , we make the prediction in a similar way to equation 6:ˆ Y ( s ) = y i , if dist( s , s i ) = 0 for some i ¯ y T , if I T < E ( I T ) and dist( s , s i ) (cid:54) = 0 for all i (cid:80) n T i =1 w i ( s ) y i (cid:80) n T i =1 w i ( s ) , if I T > E ( I T ) and dist( s , s i ) (cid:54) = 0 for all i (8)where I Y T represent the observed value of the Moran’s I statistic for the training observations in the terminalnode T with respect to the response variable Y , and E ( I Y T ) is the expected value of I Y T . ¯ y T denotes theobserved mean of the response variable Y of the training observations in T . w i remains much the same as in equation 7, the difference being we use the varying power parameter p T : w i ( s ) = 1dist( s , s i ) p T . (9)The varying power parameter p T is a monotonic increasing function that maps from [min( ˜ M ) , max( ˜ M )] to[ p , p ], where p and p are user-set parameters that indicate the range of power parameters. The terminalnode that exhibits the most significant value of I Y T will use p for its power parameter, and the terminalnode with the least significant (yet above expected) value of I Y T will use p .One choice of p T : [min( ˜ M ) , max( ˜ M )] (cid:55)→ [ p , p ] is the following linear function: p T = ( I T − min( ˜ M ))( p − p )max( ˜ M ) − min( ˜ M ) + p (10)As p and p are set by the user, it is crucial that the user have a sense of an appropriate range of values for p in the context of their particular dataset. Section 3.2 described an objective function that encourages high values of spatial autocorrelation within theinternal nodes of the tree. Section 3.2 also describes the use of interpolation at the terminal nodes of thetree to supplement the prediction of ¯ y T . In this section, we propose another possible objective function forthe tree and explore the possibility of weighting this new objective function alongside the objective function g ac described in section 3.2.When using interpolation as part of the predictive process, it is desired that terminal nodes of the treecreate sub-regions of the data that are ideal for interpolation. Excessive overlap in the regions create bythe terminal nodes is not ideal for interpolation, as much of the final prediction is weighted by distantobservations while ignoring other observations that are geographically close but not in the same terminalnode. In this section, another objective function g sc for the encouragement of spatially-compact internal andterminal nodes is introduced.At an arbitrary level of the splitting process, define the total sum of squared pairwise distances withinthe node N to be T SS D = (cid:88) s i ∈ N (cid:88) s j ∈ N dist( s i , s j ) . Consider an arbitrary partition of the data in the node N that produces a “left” and “right” partition,sub-scripted by l and r respectively. Let N l be the set of all training observations in the left partition, and N r be the set of all training observations in the right partition.Define the between sum of squared differences for the partitions to be BSS D = (cid:88) s i ∈ N l (cid:88) s j ∈ N r dist( s i , s j ) . As a sort of “spatial extension” to a 1-way anova, the total sum of squares of distances is composed of thesum of all between sums of squared differences and the sum of all within sums of squared differences. Thisis represented as
T SS D = BSS D + W SS D (11)7here W SS D = (cid:88) s i ∈ N l (cid:88) s j ∈ N l dist( s i , s j ) + (cid:88) s i ∈ N r (cid:88) s j ∈ N r dist( s i , s j ) . Minimizing
W SS D encourages spatially compact regions resulting from the split, minimizing uncertaintyin the interpolation step. Due to the identity in equation 11, this is possible by maximizing BSS D . Theprevious objective functions g rss and g ac from sections 3.1 and 3.2 respectively indicate a more desirable splitwhen g rss and g ac are higher in value and closer to 1. As BSS D T SS D ∈ [0 , g sc as: g sc = BSS D T SS D . (12)We can include g sc in the linear combination of previously discussed objective functions with the weightingparameter β : g = (1 − α − β ) g rss + αg ac + βg sc , where α, β ∈ [0 ,
1] and α + β ≤ . (13)Thus, the revised regression tree algorithm is to search through all predictor values X , X , . . . , X p lookingfor the splitting rule x < X i for some i that maximizes the objective function g of equation 13 at each recursivepartitioning of the data. The revised objective function in equation 13 and the interpolative process discussed in section 4 are promisingways to improve upon the predictions of regression trees when applied to continuous spatial data. RandomForests are a powerful extension of classification and regression trees. Random Forests increase the predictiveaccuracy of classification and regression trees by minimizing the variance of predictions by producing a“forest” of trees, each trained using a bootstrapped sample of training data and a random subset of thepredictor variables to split on at each node. The averaging of predictions that occurs in Random Forestsgreatly improve upon the predictive power of a single regression tree [26].The creation of a “forest” of autocart trees is proposed and discussed in this section.Let us denote a single autocart tree as a function f A , where a prediction is made by running a setof predictors { x , x , . . . , x p } through the splitting rules in the tree (trained by maximizing the objectivefunction g at each recursive partition), and then assigning the final prediction either by the average of theresponse variable in the terminal node (previously referred to as ¯ y T ), or by an interpolative rule ˆ Y ( s ) as inequation 6 or equation 8.We create a forest of k autocart trees by creating the set of trees F = { f A , f A , . . . , f A k } . A predictionis made by running the set of predictors { x , x , . . . , x p } through each tree, and then using the arithmeticmean of the prediction of all trees in F :ˆ Y = 1 k k (cid:88) i =1 f A i ( { x , x , . . . , x p } ) . Each regression tree f A i is trained with 2 n n observationswithout replacement. Note that this differs from standard Random Forests where n observations are sampledwith replacement. In this spatial adaption, repeat observations with identical coordinate information causesproblems in the spatial weights matrix, as this results in an “infinite” weight.As all n records have an equal chance of being chosen, the bias of f A i is not affected, especially across all k trees. Additionally, in each tree only m predictors are selected from X , X , . . . , X p at each node. m is auser-set parameter, but can be safely defaulted to (cid:100) p (cid:101) .The autoforest extension to the autocart tree is a way to imbue the Random Forest algorithm with spatialdata while refraining from explicitly including coordinates as a predictor in the splitting.( Note: The software implementation of autoforest currently chooses m predictor variables to split on ateach tree rather than at each node. This was done for ease of implementation, but a future version of the Rpackage will resolve this issue. ) 8 Results and comparison of model architectures
An optimal dataset for the autocart algorithm would include coordinate information for all training obser-vations, and the prediction of a continuous response variable over the region represented by the dataset.Additionally, all predictor variables used to train the autocart tree must be available at all locations wherepredictions are desired, as techniques to infer the value of a missing predictor variable X i are not covered inthis paper. If the autocart algorithm is to be used in a mapping setting, then gridded data across the entiremapped region is required. This dataset contains the average soil moisture level recorded in moisture per cubed centimeter for 195remote sensing stations across the state of Utah. Gridded 30-year PRISM Climate Normals [27] are used,including the 30-year normals for maximum vapor pressure deficit, mean annual temperature, and meanannual precipitation. Additionally, a digital elevation map of Utah from the PRISM Climate Normals is usedto obtain the elevation predictor and the derived slope and aspect predictors.The 30-year PRISM Climate Normals and derived data are selected for their gridded nature and possibleenvironmental relation to soil moisture.
Variables Description sm 8 The proportion of water per cm of soilelev The elevation of the location in metersslope A unitless “rise over run” measurement of the surface angle of thelocation. This is calculated from the “elev” model.aspect The compass orientation of the slope at a point measured in de-grees, where 0 and 360 degrees is north, 90 degrees is east, etc.min vpd and max vpd The 30-year estimate of minimum / maximum vapor pressuredeficit measured in kilopascals (kPa)min temp, max temp, and mean temp The 30-year estimate of minimum, maximum, and mean temper-ature measured in degrees Fahrenheitmean dewpoint The 30-year estimate of the mean dew point temperature in de-grees Fahrenheitprecip The 30-year estimate of annual precipitation in inches This dataset contains the 50 year ground snow load at a variety of measurement stations across the state ofUtah [28]. Predictors are obtained from gridded PRISM 30-year Climate Normals [27]:
Variables Description yr50 The response variable: measures the designed ground snow loadat the site in kilopascals (kPa)ELEVATION The elevation of the measurement site in metersPPTWT The total amount of precipitation in a year in inchesMCMT The mean temperature of the coldest month in the year in CelsiusMWMT The mean temperature of the warmest month in the year in Cel-siusTo fix the skewed distribution of the yr50 variable, a log transform was taken of the response.
This dataset contains variables related to mapping the presence of poverty in various states of Kenya [29, 30].The variables in the dataset include the following:9 ariables Description
POORDENS The number of poor people per km AREA The area of the active community group in KenyaFLIVDEN The density of livestock expressed in total livestockunits/km ACCWAT The percentage of area within one hour walking distance ofa permanent natural water source.PTOWNKM Distance from the shopping center in each sublocation tothe nearest major towns by road, in kms.GROUPDENS The total number of active community groups, includingnon-governmental organizations and community based or-ganizations.LATITUDE The latitude of the centroid of the community group (ob-tained from accompanying shapefile)LONGITUDE The longitude of the centroid of the community group (ob-tained from accompanying shapefile)
We use cross-validation to assess the predictive accuracy of each model. In cross-validation, we divide thedata into k disjoint partitions known as “folds”. The model is trained on k − Y of the withheld fold. The predictions ˆ Y of the model can be compared to the realresponse Y for an assessment of the performance of the model. We repeat the training on k − k times, withholding a different fold each time, such that all data in the training data eventually has thechance to be withheld and compared to a prediction from the model where the fold was absent. This strategyof withholding information at each step provides an estimate of a model’s ability to predict new observationsand discourages over-fitting the input data. Cross-validation is the gold standard for the assessment of amodel when a separate testing dataset is not available.One choice in forming the k folds is to randomly select observations from the training data to be a part ofeach fold. However, the autocorrelation present in spatial datasets can cause traditional cross-validation tounder-estimate the true model error. [16] discusses this issue and presents a solution for the cross-validationof spatial data known as “spatial blocking”. In this setup, the folds in cross-validation are formed by creatingchunks of neighboring observations, which limits the opportunity for geographically close neighbors to be apart of different folds in cross-validation.If the spatial blocks that form the cross-validation folds are too large, then the predictive power of themodel may be underestimated , as in a realistic setting predictions may be often made at sites that are veryclose to the training observations. Additionally, if we consider the regression tree models discussed in thispaper to be a tool for decomposing a global spatial process into separate local processes, then withholding alldata in a large spatial block region may inadvertently eradicate the local spatial region’s representation. Onthe other hand, if the spatial blocks are too small, then the number of folds may be very large and dramaticallyincrease the computation time required to perform the cross-validation procedure. In the absence of well-defined rules regarding the geographical size of the cross validation folds, groups were constructed throughthe process of trial and error through a consideration of the maximum geographical distance between within-group observations. In the case of both the Utah 2017 snow and soil datasets, a distance of 15km was chosen.For the Utah 2017 snow dataset, this yields around hundred sub-groups which were then consolidated into10 larger groups for use in cross-validation.Once a vector of predictions from the model has been created for all 10 folds, the results of each algorithmon each dataset are evaluated with the root mean square error (RMSE). This is a common metric to assessthe predictive accuracy of cross-validation for continuous regression problems. RM SE = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 (ˆ y i − y i ) where n is the number of observations in the dataset used for cross-validation, y i is the i th element ofthe true response vector of the training data, and ˆ y i is the i th element of the prediction vector made by themodel from 10-fold cross-validation. 10 .2.1 A tuning method for autocart As the autocart function contains several tune-able parameters (namely α , β , and the spatial bandwidth ’ b ’),we need to be careful how we select the optimal choice of parameters during cross-validation. The “best”performing choices of α , β , and b over all the training data may be different than the best performing choicesfor a random subset of the data, such as a subset used in cross-validation. In a realistic scenario, we predictnew data that was not a part of the training data, and thus we do not have the labeled response variable y to tune the parameters with. Thus, instead of tuning the parameters α , β , and b over all the data and thenperforming cross-validation, we tune the parameters to the 9 withheld groups, and then predict onto the last“test” group. In this way, the optimal choices for α , β , and b will likely vary each time we withhold a group.In the next section, the cross-validated accuracy of autocart is obtained using this tuning method. To test the datasets, 6 different methods are used: • “Simple” Prediction : This is a baseline prediction method that ensures a machine-learning methodis providing an improvement over an overly-simplistic model. The simple prediction is formed by takingthe average of the response variable in the nine withheld groups, and then using that average to predictfor the withheld group in cross-validation. • Regression trees : Simple regression trees that are introduced at the beginning of the paper. Thetrees are pruned to the point with the best cross-validated accuracy. • Autocart with p = 2: An autocart interpolation tree using the power parameter of p = 2. • Autocart with p = 0 . , p = 3 .
0: An autocart interpolation tree using the ranged power parameter p and p , meant to provide a comparison to the unranged power parameter p = 2. • Random forest with ntree = 100 : A random forest made of 100 regression trees. • Autoforest with ntree = 100 : An autoforest made up of 100 autocart trees.In each column representing a dataset, the RMSE of the “best performing” model is written in bold font.
RMSE of spatial cross-validationDatasetMethod
September 2017 Soil(proportion of watercomposition per cm ) Utah 2017 Snow (logof 50 year groundsnow load avg in kPa) Kenya Poverty Map-ping (log of numberof poor residents perkm )“Simple Prediction” 0.0882 0.8890 1.219Regression trees 0.1082 0.3445 1.255Autocart with p = 2 0.0962 0.3097 0.966Autocart with p = 0 . , p = 3 . Autoforest with ntree = 100
In the ”September 2017 Soil” dataset, we observe that the simple regression using the average of the responsein the nine withheld groups was nearly the best performing method, outperforming all except Random Forestsby a slim margin.This highlights an inadequacy in the data, as none of the tested machine-learning methods are capableof learning the patterns in the labeled response variable given the set of gridded predictor variables. Thefollowing are possible explanations for the poor performance of the models on the data:11.
The variation in soil moisture may be much more “localized” than previously thought.
Asthe soil moisture data is only available at 195 sites in the 2017 soil moisture dataset, there is a strongpossibility that there does not exist enough data for the machine-learning models to appropriatelycharacterize the patterns in the landscape.2.
The given gridded predictor variables have no relationship with soil moisture.
One require-ment for candidate predictor variables is that they are available as high resolution gridded maps for thearea of interest. The number of candidate variables satisfying this requirement is limited. Thus, theremay be variables better suited for soil moisture prediction, but are unusable in their current forms.3.
The data may be contaminated.
Some of the sampling locations are yield unusually high soilmoisture measurements. This may be the result of an improper inclusion of irrigation site data (yieldingan artificially high measurement of soil moisture) or perhaps an anomalous rainy day. Such anomaliesdefy relationships that may otherwise exist between soil moisture and the response variables, but thereis no current way to know which observations should be removed due to human intervention in the soilmoisture content.Datasets covering other time periods were supplied by the Plants, Soils, and Climates department atUSU. However, these datasets contained considerably less soil moisture remote sensing stations, around 95as opposed to the 195 in the September 2017 datasets. The timeline required to obtain the additional soilmoisture information fell outside the scope of this current project.All other datasets with the 95 remote sensing sites exhibited the same problems as the September 2017soil moisture dataset: overall poor performance from tree-based machine-learning methods using the PRISMgridded climate normals. This was the same regardless of time period observed. Tested time periods includedthe average of all summer month soil moisture measurements since 2016, the average of soil moisture measure-ments in individual months, weeks, and days. Future research will require a re-evaluation of the candidatepredictor variables for soil moisture as well as methods to detect and remove data anomalies associated withirrigated sites.
Using the cross-validated RMSE score is not the only way we can evaluate the performance of these methods.In Section 2 and Figure 2, it is mentioned that the ultimate aim of these methods is to characterize a complexlandscape in a physically realistic way.As an example, in the field of meteorology, advection schemes are a way to model the horizontal flowof heat or matter through a certain geographic region. Advection schemes are based upon the gradients(i.e. local rates of change) which must be relatively smooth to avoid numerical precision issues in modeling.Having an extreme jump in a modeled surface can lead to an extreme local gradient, thereby disrupting thesmall-scale meteorology. While smoothness is not a necessary condition of predictions on all spatial datasets,most spatial and ecological datasets benefit from predictions with realistic transitions in modeled valuesacross space.In a spatial or ecological setting, it may be the case that we reject a method that has a slightly highermeasure of predictive accuracy in favor of a method with similar accuracy and physically realistic output.In this sense, the method that appears to be providing the map with the most detail and physical realismacross the geographical landscape may be judged to be the “superior” method, as there is strong evidence itis characterizing the patterns inherently present in the data. We are unaware of any methods that formallybalance accuracy with smoothness. In practice, this balance will be domain specific.A critical parameter in the formation of a regression tree is the choice of pruning: the depth that the treeis grown to. A pruned regression tree may have few terminal nodes, thus yielding a “chunky” landscape withsharp and discrete breaks across the space. It would unfair to say that a pruned regression tree does a poorjob of characterizing a complex landscape. To circumvent this unfair comparison, in the mapping processboth the simple regression tree and the autocart tree are not pruned. The stopping criteria for the autocartand regression tree growth are the same, so we are ensured that there will be a similar number of terminalnodes in each tree. 12 .2.1 2017 Utah Soil Moisture
Although the RMSE of all machine-learning methods tested are less than impressive when compared againstthe “simple” regression discussed in Section 6.2.2, the maps generated by the methods are promising: (a) Regression trees (b) Autocart with ranged power parameter p = 0 . , p = 3 . Figure 3: Prediction map of average soil moisture (proportion of water per cm ) in September 2017In Figure 3b, we observe the characterization of a more complex landscape, as compared to Figure 3a.One noticeable “halo” of the interpolative effect can be seen at approximately (112 W, 41 N).13 a) Random forests (b) Autoforest with ntree = 100 and p = 0 . , p = 3 . Figure 4: Prediction map of average soil moisture (proportion of water per cm ) in September 2017 withensemble methods.In both the Random Forest and Autoforest maps, we see an improved characterization of the complex land-scape with physically realistic predictions as compared to the traditional regression-tree based approaches.Although there is a discernible difference between the two maps, the difference is not as stark as with Figures3a and 3b.It is important to mention that Figure 4a differs from Figure 2 from the beginning of the paper in thatlongitude/latitude are not included as predictors in the model. As the coordinate information is not explicitlyincluded as a predictor, we do not observe the same visual artefacts that we did in Figure 2.14 .2.2 2017 Utah ground snow load (a) Regression trees (b) Autocart with ranged power parameter p = 0 . , p = 3 . Figure 5: Prediction map of 50-year ground snow load average (log of kPa) ground snow load in Utah as of2017In a similar way to Figure 3, we observe that Figure 5b is slightly more smooth and complex in nature.The smoothness of the autocart tree can once again be primarily attributed to the interpolation step at theterminal node of the tree.The efficacy of both the regression tree and autocart tree is observed here, as the patterns in the mapdirectly reflect the mountains of Utah. It is no surprise that high snow loads are observed in mountainouslocations. 15 a) Random forests (b) Autoforest with ntree = 100 and p = 0 . , p = 3 . Figure 6: Prediction map of 50-year ground snow load average (log of kPa) ground snow load in Utah as of2017 with ensemble methodsIn the prediction map of Random Forests and Autoforest, we see very few differences. Primarily, weobserve a slightly more “patchy” area in southwestern Utah in Figure 6a when compared with the smoothsouthwestern Utah area in Figure 6b.
All software and source code surrounding the ideas in this paper are available at https://github.com/ethanancell/autocart .All methods are implemented in the R statistical software environment [31]. The autocart package utilizesthe Rcpp [32, 33, 34], RcppArmadillo [35], RcppParallel [36], fields [37], and mgcv [38, 39, 40, 41, 42] Rpackages. Regression tree and random forest testing were performed using the rpart [43] and randomForest[44] packages respectively. • Resolving possible issues in the soil moisture dataset
As mentioned in Section 7, a future research direction would be in obtaining gridded predictor variablesthat have a stronger relationship with the measure of soil moisture. As it currently stands, the PRISM30-year normals and other derived predictors don’t appear to strongly characterize soil moisture.In addition, it would be helpful to have a human examination of the data so that contaminated sites(such as measurements taken on irrigated land) can be removed. The inclusion of these sites can tamperwith a possible underlying relationship between predictor variables and soil moisture.16
Another ensemble method of autocart trees
In addition to a “forest” of autocart trees, another interesting research direction would be in creatinga boosted autocart trees. Boosted trees are another tree-ensemble method [45]. Boosted trees are verypopular and powerful tools for both regression and classification. It may be the case that a boostedensemble of autocart trees does comparatively better than a random forest ensemble of autocart trees. • An automatic selection of the power parameter
Wherever the ranged or unranged power parameter exists, some speculation and knowledge on the partof the researcher is required to pick a good value of p or p and p . It may be possible to include analgorithm that can automatically pick up on the weight of spatial strength in a particular region. • An objective function representing the interaction between g ac and g sc It may be the case that the objective function g sc described in Section 4.2 is only useful in regionswith strong autocorrelation. There may be some predictive value in selecting splits that maximizean objective function such as ˜ g = g ac · g sc · I { g ac >λ } where I is the indicator function and λ is somethreshold of autocorrelation that needs to be met before the interaction objective function g ac g sc isused in evaluating the split. • Using g ac and g sc in a non-tree-based machine-learning method In many cases, other machine-learning methods are easily adaptable with a change in the objectivefunction. The objective functions g ac and g sc described in this paper could be used as objectivefunctions for other methods such as neural networks or support vector machines. • A formal evaluation of the smoothness of a region
In Section 7.2, the smoothness of the maps is compared with a visual assessment. The visual comparisonof these maps could be supplemented with a numerical test of “smoothness” over a region.
10 Conclusion
Complex spatial datasets pose a difficult challenge to existing machine-learning methods are they are notequipped to handle spatial information in a natural way. Autocart regression trees provide a way to imbuea traditional regression tree with coordinate information in a natural way through the use of spatially-awareobjective functions. The autocart tree is also capable of traditional interpolation in the terminal nodes ofthe tree using an adaptive inverse distance weighting technique.Spatially-aware regression trees such as autocart also show a level of promise in providing results with ahigh measure of predictive accuracy on spatial datasets. In addition, the mapping result of autocart treesexhibit many desirable properties such as smoothness in the modeled response variable over the geographicalregion. In many cases, the mapped result of an autocart tree is very similar to that obtained by a randomforest, yet retains a simpler model form.Although the autocart method was originally created to respond to the unique challenges of modelingsoil moisture in a climatically-diverse state like Utah, as of now this method does not show a particularlystrong increase in predictive accuracy over a very simple regression formed by averaging the available data.It is suspected that this may be the product of a lack of soil moisture data availability, contamination ofirrigation data, or an unfortunate lack of predictive power in available gridded climate variable predictors.Due to the simple nature of the autocart tree’s structure, it can be easily included in an ensemble methodsuch as a random forest of autocart trees. Although the “autoforest” gain in accuracy over its traditionalcounterpart is not as pronounced in the random forest as it was with regression trees, it is evident thatautocart trees show potential in being adapted into ensemble methods. The autocart package provides theframework and template for continued improvements in the ways that the spatial data of today is handledin the machine-learning algorithms of tomorrow. 17 eferences [1] I. Rodriguez-Iturbe, “Ecohydrology: A hydrologic perspective of climate-soil-vegetation dynamies,”
Wa-ter Resources Research
Water Resources Research , vol. 55, no. 1,pp. 729–753, 2019.[4] P. Goovaerts,
Geostatistics for natural resources evaluation . Oxford University Press, 1997.[5] A. B. McBratney, M. M. Santos, and B. Minasny, “On digital soil mapping,”
Geoderma , vol. 117, no. 1-2,pp. 3–52, 2003.[6] J. Friedman, T. Hastie, and R. Tibshirani,
The elements of statistical learning . Springer series instatistics New York, 2001.[7] E. W. Fox, J. M. Ver Hoef, and A. R. Olsen, “Comparing spatial regression to random forests for largeenvironmental data sets,”
PloS one , vol. 15, no. 3, p. e0229509, 2020.[8] R. Grimm, T. Behrens, M. M¨arker, and H. Elsenbeer, “Soil organic carbon concentrations and stocks onbarro colorado island—digital soil mapping using random forests analysis,”
Geoderma , vol. 146, no. 1-2,pp. 102–113, 2008.[9] J. Li, A. D. Heap, A. Potter, Z. Huang, and J. J. Daniell, “Can we improve the spatial predictions ofseabed sediments? a case study of spatial interpolation of mud content across the southwest australianmargin,”
Continental Shelf Research , vol. 31, no. 13, pp. 1365–1376, 2011.[10] M. Sommer, M. Wehrhan, M. Zipprich, U. Weller, W. Zu Castell, S. Ehrich, B. Tandler, and T. Selige,“Hierarchical data fusion for mapping soil units at field scale,”
Geoderma , vol. 112, no. 3-4, pp. 179–196,2003.[11] D. R. Cutler, T. C. Edwards Jr, K. H. Beard, A. Cutler, K. T. Hess, J. Gibson, and J. J. Lawler,“Random forests for classification in ecology,”
Ecology , vol. 88, no. 11, pp. 2783–2792, 2007.[12] G. De’ath and K. E. Fabricius, “Classification and regression trees: a powerful yet simple technique forecological data analysis,”
Ecology , vol. 81, no. 11, pp. 3178–3192, 2000.[13] T. Hengl, J. Mendes de Jesus, G. B. Heuvelink, M. Ruiperez Gonzalez, M. Kilibarda, A. Blagoti´c,W. Shangguan, M. N. Wright, X. Geng, B. Bauer-Marschallinger, et al. , “Soilgrids250m: Global griddedsoil information based on machine learning,”
PLoS one , vol. 12, no. 2, p. e0169748, 2017.[14] A. Ramcharan, T. Hengl, T. Nauman, C. Brungard, S. Waltman, S. Wills, and J. Thompson, “Soilproperty and class maps of the conterminous united states at 100-meter spatial resolution,”
Soil ScienceSociety of America Journal , vol. 82, no. 1, pp. 186–201, 2018.[15] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen,
Classification and regression trees . CRC press,1984.[16] H. Meyer, C. Reudenbach, S. W¨ollauer, and T. Nauss, “Importance of spatial predictor variable selec-tion in machine learning applications–moving from data reproduction to spatial prediction,”
EcologicalModelling , vol. 411, p. 108815, 2019.[17] R. A. Dubin, “Spatial autocorrelation: a primer,”
Journal of housing economics , vol. 7, no. 4, pp. 304–327, 1998.[18] W. R. Tobler, “A computer movie simulating urban growth in the detroit region,”
Economic geography ,vol. 46, no. sup1, pp. 234–240, 1970. 1819] P. Legendre, “Spatial autocorrelation: Trouble or new paradigm?,”
Ecology , vol. 74, no. 6, pp. 1659–1673,1993.[20] H.-M. Kim, B. K. Mallick, and C. Holmes, “Analyzing nonstationary spatial data using piecewise gaus-sian processes,”
Journal of the American Statistical Association , vol. 100, no. 470, pp. 653–668, 2005.[21] M. J. Heaton, W. F. Christensen, and M. A. Terres, “Nonstationary gaussian process models usingspatial hierarchical clustering from finite differences,”
Technometrics , vol. 59, no. 1, pp. 93–101, 2017.[22] D. Stojanova, M. Ceci, A. Appice, D. Malerba, and S. Dˇzeroski, “Dealing with spatial autocorrelationwhen learning predictive clustering trees,”
Ecological Informatics , vol. 13, pp. 22–39, 2013.[23] A. Appice and D. Malerba, “Leveraging the power of local spatial autocorrelation in geophysical inter-polative clustering,”
Data Mining and Knowledge Discovery , vol. 28, no. 5-6, pp. 1266–1313, 2014.[24] S. Georganos, T. Grippa, A. Niang Gadiaga, C. Linard, M. Lennert, S. Vanhuysse, N. Mboga, E. Wolff,and S. Kalogirou, “Geographical random forests: a spatial extension of the random forest algorithmto address spatial heterogeneity in remote sensing and population modelling,”
Geocarto International ,pp. 1–16, 2019.[25] P. A. Moran, “Notes on continuous stochastic phenomena,”
Biometrika , vol. 37, no. 1/2, pp. 17–23,1950.[26] L. Breiman, “Random forests,”
Machine learning , vol. 45, no. 1, pp. 5–32, 2001.[27] O. S. U. PRISM Climate Group. http://prism.oregonstate.edu . created 4 Feb 2004.[28] B. Bean, M. Maguire, and Y. Sun, “The Utah snow load study,” Tech. Rep. 4591, Utah State University,Department of Civil and Environmental Engineering, 2018.[29] Center For International Earth Science Information Network-CIESIN-Columbia University, “Povertymapping project: Poverty and food security case studies,” 2005.[30] International Livestock Research Institute, 20040115, “Kenya Kajiado Case Study: ILRI, Nairobi.”[31] R Core Team,
R: A Language and Environment for Statistical Computing . R Foundation for StatisticalComputing, Vienna, Austria, 2013.[32] D. Eddelbuettel and R. Fran¸cois, “Rcpp: Seamless R and C++ integration,”
Journal of StatisticalSoftware , vol. 40, no. 8, pp. 1–18, 2011.[33] D. Eddelbuettel,
Seamless R and C++ Integration with Rcpp . New York: Springer, 2013. ISBN 978-1-4614-6867-7.[34] D. Eddelbuettel and J. J. Balamuta, “Extending extitR with extitC++: A Brief Introduction to extitR-cpp,” , vol. 5, p. e3188v1, aug 2017.[35] D. Eddelbuettel and C. Sanderson, “Rcpparmadillo: Accelerating r with high-performance c++ linearalgebra,”
Computational Statistics and Data Analysis , vol. 71, pp. 1054–1063, March 2014.[36] JJ Allaire et al., “RcppParallel.” https://cran.r-project.org/web/packages/RcppParallel/index.html,2020.[37] Douglas Nychka, Reinhard Furrer, John Paige, and Stephan Sain, “fields: Tools for spatial data,” 2017.R package version 11.4.[38] S. N. Wood, “Fast stable restricted maximum likelihood and marginal likelihood estimation of semipara-metric generalized linear models,”
Journal of the Royal Statistical Society (B) , vol. 73, no. 1, pp. 3–36,2011.[39] S. Wood, N., Pya, and B. S”afken, “Smoothing parameter and model selection for general smooth models(with discussion),”
Journal of the American Statistical Association , vol. 111, pp. 1548–1575, 2016.1940] S. N. Wood, “Stable and efficient multiple smoothing parameter estimation for generalized additivemodels,”
Journal of the American Statistical Association , vol. 99, no. 467, pp. 673–686, 2004.[41] S. Wood,
Generalized Additive Models: An Introduction with R . Chapman and Hall/CRC, 2 ed., 2017.[42] S. N. Wood, “Thin-plate regression splines,”
Journal of the Royal Statistical Society (B) , vol. 65, no. 1,pp. 95–114, 2003.[43] Terry Therneau, Beth Atkinson, Brian Ripley, “rpart.” https://CRAN.R-project.org/package=rpart,2019.[44] A. Liaw and M. Wiener, “Classification and regression by randomforest,”
R News , vol. 2, no. 3, pp. 18–22, 2002.[45] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an applicationto boosting,”