Ordinal Trees and Random Forests: Score-Free Recursive Partitioning and Improved Ensembles
OOrdinal Trees and Random Forests: Score-FreeRecursive Partitioning and Improved Ensembles
Gerhard Tutz
Ludwig-Maximilians-Universit¨at M¨unchenAkademiestraße 1, 80799 M¨unchen
February 2, 2021
Abstract
Existing ordinal trees and random forests typically use scores that are as-signed to the ordered categories, which implies that a higher scale levelis used. Versions of ordinal trees are proposed that take the scale levelseriously and avoid the assignment of artificial scores. The basic construc-tion principle is based on an investigation of the binary models that areimplicitly used in parametric ordinal regression. These building blocks canbe fitted by trees and combined in a similar way as in parametric models.The obtained trees use the ordinal scale level only. Since binary trees andrandom forests are constituent elements of the trees one can exploit thewide range of binary trees that have already been developed. A furthertopic is the potentially poor performance of random forests, which seemsto have ignored in the literature. Ensembles that include parametric mod-els are proposed to obtain prediction methods that tend to perform wellin a wide range of settings. The performance of the methods is evaluatedempirically by using several data sets.
Keywords: Recursive partitioning, trees; random forests; ensemble methods;ordinal regression
There is a long tradition of analyzing ordinal response data by using paramet-ric models, which started with the seminal paper of (McCullagh, 1980). Morerecently, recursive partitioning methods have been developed that allow to inves-tigate the impact of explanatory variables by non-parametric tools. Single andrandom trees for ordinal responses have several advantages, they can be appliedto large data sets and are considered to perform very well in prediction.1 a r X i v : . [ s t a t . M E ] J a n problem with most of the ordinal trees is that they assume that scoresare assigned to the ordered categories of the response. The assignment of scorescan be warranted in some cases, in particular if ordinal responses are built fromcontinuous variables by grouping. However, it is rather artificial and arbitraryin genuine ordinal response data, for example, if the response represents orderedlevels of severeness of a disease. Then one can not choose the midpoints of theintervals from which the ordered response is built as suggested by Hothorn et al.(2006) since no continuous variable is observed. If nevertheless scores are assignedthey can affect the prediction results although that has not always to be the case,see also Janitza et al. (2016).The packages rpartOrdinal (Archer, 2010) as well as the improved version rpartScore (Galimberti et al., 2012), which are based on the Gini impurity func-tion, use assigned scores. The same holds for the random forests proposed byJanitza et al. (2016) and the ordinal version of conditional trees of the pack-age party (Hothorn et al., 2006; Hothorn and Zeileis, 2015). The random forestapproach proposed by Hornung (2020) is somewhat different, it also translatesordinal measurements into continuous scores but optimizes scores instead of us-ing a fixed score. Versions of random forests without scores were proposed morerecently by Buri and Hothorn (2020). They use the ordinal proportional oddsmodel to obtain statistics that are used in splitting.In the following alternative trees and random forests that take the scale levelof the response seriously are proposed. The main concept is that ordinal responsescontain binary responses as building blocks. This has already been implicitly usedin parametric modeling approaches. For example, the widely used proportionalodds model can be seen as a model that parameterizes the split of responsecategories into two groups of adjacent categories. But the principle also holds foralternative models as the adjacent categories model and the sequential model,see Tutz (2021) for an overview and a taxonomy of ordinal regression models.The proposed trees explicitly use the representation of ordinal responses as a setof binary variables. Random forests for the binary variables are used to obtainrandom forests for ordinal response data.For random forests it is important that they provide good performance interms of prediction. They are commonly considered as being very efficient. How-ever, as will be demonstrated this does not hold in general. In many cases simpleparametric models turn out to be at least as efficient and sometimes more effi-cient than the carefully designed random forests. Typically, when ordinal forestsare propagated the accuracy is investigated for versions of random forests onlybut they are not compared to parametric competitors. In the following we prop-agate the use of ensembles that include parametric models to provide a stableprediction tool that works well in all kinds of data sets.The paper has two objectives, introducing score-free recursive partitioning andrandom forests, and propagating ensembles that include parametric models. InSection the representation of ordinal responses as a sequence of binary responses2s briefly considered. It makes clear that specific binary responses can be seen asbuilding blocks of classical parametric models. In Section 3 it is shown how thesebuilding blocks can be used to construct score-free trees and random forests. Inaddition, more general ensembles are considered. In Section 4 the performanceof the ensembles is investigated by using real data sets. Section 5 is devoted toimportance measures, which are an essential ingredient of random forests sincethe impact of variables on prediction in random forests is not directly available. In the following the representation of ordinal response as a collection of binaryresponses is considered. It can be seen as being behind the construction of para-metric ordinal models and will serve to construct a novel type of recursive parti-tioning that does not use assigned scores.Let the ordinal response Y take values from { , . . . , k } . Although these val-ues suggest a univariate response the actual response is multivariate since thenumbers 1 , . . . , k just represent that outcomes are ordered but distances betweennumbers assigned to categories should not be built in an ordinal scale becausethey are not interpretable.A multivariate representation of the outcome can be obtained by using binarydummy variables. Natural candidates for dummy variables are the split variables Y r = (cid:26) Y ≥ r Y < r. (1)Then Y = r is represented by a sequence of ones followed by a sequence of zeros,( Y . . . , Y k ) = (1 , . . . , , , . . . , . The vector ( Y . . . , Y k ) can be seen as a multivariate representation of the re-sponse. The dummy variables that generate vectors of this form, which arecharacterized by a sequence of ones followed by a sequence of zeros have also bereferred to as Guttman variables (Andrich, 2013).Classical ordinal regression model use these dummy variables but are mostoften derived from the assumption of an underlying continuous variable, and thelink to split variables is ignored. The most widely used proportional odds model ,also called cumulative logistic model , has the form P ( Y ≥ r | x ) = F ( β r + x T β ) , r = 2 , . . . , k. (2)where x is a vector of explanatory variables and F ( η ) = exp( η ) / (1 + exp( η ))is the logistic distribution function. For the parameters one has the restriction β ≥ · · · ≥ β k . The model explicitly uses the dichotomizations given by (1).Since Y ≥ r iff Y r = 1 the model can also be given as P ( Y r = 1 | x ) = F ( β r + x T β ) , r = 2 , . . . , k. (3)3hus, the proportional odds model is equivalent to a collection of binary logitmodels that have to hold simultaneously. The model implies that the effect ofcovariates contained in x T β is the same for all dichotomizations. That meansif one fits the binary models (3) separately one should obtain similar values forestimates of β . This restriction can be weakened by using the partial proportionalodds model, in which the effect of variables may depend on the category, that is,the linear term x T β in (2) is replaced by x T β r .Model (2) is a so-called cumulative model since on the left hand side one hasthe sum of probabilities P ( Y ≥ r | x ). Cumulative models form a whole familyof models, whose members are characterized by the choice of a specific strictlyincreasing distribution function F ( . ). They have been investigated and extended,among others, by McCullagh (1980), Brant (1990), Peterson and Harrell (1990),Bender and Grouven (1998), Cox (1995), Kim (2003) and Liu et al. (2009).An alternative ordinal regression model is the adjacent categories model , whichhas the basic form P ( Y ≥ r | Y ∈ { r − , r } , x ) = F ( β r + x T β ) , r = 2 , . . . , k. (4)Since P ( Y ≥ r | Y ∈ { r − , r } , x ) = P ( Y = r | Y ∈ { r − , r } , x ) it specifies theprobability of observing category r given the response is in categories { r − , r } .Because of the conditioning it can be seen as a local model. The interesting pointis that it also uses the split variabales. It is easily seen that it is equivalent to P ( Y r = 1 | Y r − = 1 , Y r +1 = 0 , x ) = F ( β r + x T β ) , r = 2 , . . . , k. (5)Thus, it specifies the binary response variable Y r conditionally in contrast tocumulative models, which determine the binary response directly in an uncondi-tional way. But as for cumulative models it is assumed that the binary models(5) hold simultaneously.The adjacent categories logit model may also be considered as the regressionmodel that is obtained from the row-column (RC) association model consideredby Goodman (1981a,b), Kateri (2014). It is also related to Anderson’s stereotypemodel (Anderson, 1984), which was considered by Greenland (1994) and Fernan-dez et al. (2019). It has been most widely used as a latent trait model in the formof the partial credit model (Masters, 1982; Masters and Wright, 1984; Muraki,1997).An advantage of the adjacent categories model is that one can replace the pa-rameter vector β by a category-specific parameter vector β r without running intoproblems. In cumulative models one has the restriction P ( Y ≥ | x ) ≥ . . . P ( Y ≥ k | x ), which can yield problems, in partivullar when fitting the binary models (3)with category-specific parameter vectors β r . For overviews of parametric ordinalmodels, see, for example, Agresti (2010), Tutz (2012). They also include a thirdtype of ordinal model, the sequential model, which is a specific process model,4hich could also be extended to tree type models. But because of its specificnature we do not consider it explicitly.The main point is that binary models are at the core of parametric ordinalmodels. There is a good reason for that because the splits represent the order incategories without assuming more than an order of categories. In the followingthis is exploited to construct trees that account for the ordering of categories. The crucial role of split variables in modeling ordered response can be used toobtain non-parametric tree models that use the ordering efficiently. There arebasically two ways to do so, one is by using the split variables directly, whichcorresponds to cumulative type models, the other approach is to use them con-ditionally, which corresponds to the adjacent categories approach.
Split variables are binary and therefore binary trees can be fitted. Let the treefor Y r be given by log P ( Y r = 1 | x ) P ( Y r = 0 | x ) = tr r ( x ) , r = 2 , . . . , k, (6)where tr r ( x ) denotes the partitioning of the predictor space, that is, the tree.Then one obtains for the probabilities P ( Y r = r | x ) = P ( Y ≥ r | x ) = exp(tr r ( x ))1 + exp(tr r ( x )) . The corresponding trees are called split-based trees . Split variables are a formaltool to group categories but have substantial meaning in many applications. Forexample, in the retinopathy data set (Bender and Grouven, 1998), which will alsobe considered later, the response categories are 1: no retinopathy, 2: nonprolif-erative retinopathy, 3: advanced retinopathy or blind. Thus the split betweencategories { } and { , } distinguishes between healthy and not healthy, whereasthe split between { , } and { } distinguishes between serious illness and other-wise. It is crucial that explanatory variables may play different roles for differentsplits. In the retinopathy data set, with explanatory variables smoking (SM =1: smoker, SM = 0: non-smoker), diabetes duration (DIAB) measured in years,glycosylated hemoglobin (GH), measured in percent, and diastolic blood pressure(BP) measured in mmHg, one obtains for the two splits the trees shown in Figure1 (fitted by using ctree , Hothorn et al. (2006)). It is seen that trees are quitedifferent, which means that explanatory variables play differing roles when usedto distinguish between healthy and not healthy and between serious illness andless serious illness. 5 IABp < 0.0011 £ > £ > £ > £ > £ > £ > £ > £ > £ > £ > £ > Figure 1:
Conditional trees for retinopathy data, upper panel: split between { } and { , } . lower panel: split between { , } and { } . Instead of the unconditional split variables considered previously let us considerthe conditional binary variables˜ Y r = (cid:26) Y ≥ r given Y ∈ { r − , r } Y < r given Y ∈ { r − , r } , (7) r = 2 , . . . , k . The variables are conditional versions of split variables. Moreconcrete, ˜ Y r represents Y r | Y r − = 1 , Y r +1 = 0. The main difference between˜ Y r and Y r is that the former is a conditional variable. This is important sincefitting a tree to ˜ Y r means one includes only observations with Y ∈ { r − , r } .The corresponding tree can be seen as a nonparametric version of the adjacentcategories model and is called an adjacent categories tree . The correspondingtrees are local, they reflect the impact of explanatory variables on the distinctionbetween adjacent categories.Adjacent categories trees have a different interpretation than trees for splitvariables. For illustration, Figure 2 shows the fitted trees for the retinopathydata. It is seen that diabetes duration (DIAB) has an impact in both trees. Inthe split between categories 1 and 2 the only other variable that is significant is6lycosylated hemoglobin while in the split between categories 2 and 3 it is bloodpressure. Trees are smaller than split-based trees since due to conditioning thenumber of observations is smaller. From a substantial point of view it might bemost interesting to combine trees from the different splitting concepts. The firsttree in Figure 1 distinguishes between { } and { , } , that is between healthyand non healthy. The second tree in Figure 2 shows which variables are signif-icant when distinguishing between categories 2 and 3 given the response is incategories { , } , that is, which variables are influential given the patient suffersfrom retinopathy. DIABp < 0.0011 £ > £ > £ > £ > Figure 2:
Conditional trees for retinopathy data, left: split given Y ∈ { , } ,right: split given Y ∈ { , } . Single trees can be informative for researchers that want to investigate which vari-ables have an impact on specific dichotomizations. If one has prediction in minda better choice are random trees, which are much more stable and efficient thansingle trees (Breiman, 1996, 2001; B¨uhlmann et al., 2002). Then it is necessaryto combine the results of single trees in a proper way.Let us first consider split-based trees. They face the problem familiar fromcumulative models with category-specific effects that specific constraints have tobe fulfilled. More specifically, for all values of x the constraint P ( Y ≥ | x ) ≥· · · ≥ P ( Y ≥ k − | x ) has to hold, which is equivalent to P ( Y = 1 | x ) ≥ · · · ≥ P ( Y k − = 1 | x ). However, for separately fitted trees the corresponding conditiontr ( x ) ≥ · · · ≥ tr k − ( x ) does no necessarily hold. The same problem occurs inpartial proportional odds model, for which β + x T β ≥ · · · ≥ β k + x T β k hasto hold.Let ˆ π ( x ) ( r ) = ˆ P ( Y ≥ r | x ) denote the estimated cumulative probabilitiesresulting from the tree for the split variable Y r . Then probabilities are obtainedby ˆ P ( Y = r | x ) = ˆ π ( x ) ( r ) − ˆ π ( x ) ( r +1) if ˆ π ( x ) ( r ) ≥ ˆ π ( x ) ( r +1) for all r . If the lattercondition does not hold cumulative probabilities ˆ π ( x ) ( r ) , . . . , ˆ π ( x ) ( k ) are fittedto be decreasing by using monotone regression tools. Alternative approaches7o obtain compatible estimators have been considered in the machine learningcommunity, for example, by Chu and Keerthi (2007).An advantage of adjacent categories trees is that no monotonization toolsare needed since estimated probabilities are always compatible. Let the adjacentcategories trees be given bylog P ( ˜ Y r = 1 | x ) P ( ˜ Y r = 0 | x ) = ˜tr r ( x ) , r = 2 , . . . , k, (8)where ˜tr r ( x ) denotes the partitioning of the predictor space. It is not hard toderive that the probability of an response in category r given the representation(8) holds has the form P ( Y r = r | x ) = exp( (cid:80) rs =2 ˜tr s ( x )) (cid:80) ks =1 exp( (cid:80) sl =2 ˜tr s ( x )) , (9)where (cid:80) l =2 ˜tr s ( x )) = 0. The representation (9) holds for any values of˜tr ( x ) , . . . , ˜tr k ( x ), no specific restriction has to be fulfilled.Random forests are obtained by combining not only the trees for split vari-ables but averaging over a multitude of trees generated by randomization. Theapproach proposed here exploits the role of the split variables as building blocksfor ordinal responses, and can be seen as a split variables based approach, whichis unconditional in split-based trees and conditional in adjacent categories trees. Before investigating the proposed random forests in detail let us point out a prob-lem with ordinal trees that is often ignored. Most presentations of ordinal treesfocus on the development of novel trees but do not compare the performanceof random forests to the performance of simple parametric models as the pro-portional odds model. That leaves the impression that random forests are themost efficient tools. As will be demonstrated in the following sections parametricmodels should not be ignored, in many applications they can perform as wellas random forests or even better. The use of parametric ordinal models for theprediction of ordinal responses has some tradition, see, for example, Rudolferet al. (1995), Campbell and Donner (1989), Campbell et al. (1991), Andersonand Phillips (1981).Trees themselves are ensemble methods that combine various trees to obtaina good approximation of the underlying response probabilities. To exploit thepotential strength of parametric models we propose an ensemble that includesthese models. When estimating response probabilities we will use the ensembleˆ P ( Y = r | x ) = M (cid:88) j =1 w j ˆ P j ( Y = r | x ) , P j ( Y = r | x ) are estimated probabilities for the j -th learner. Learnerscan be random forests but also parametric models. The weights w j are chosenaccording to the prediction performance of the j -th learner. The ensemble effi-ciently uses different types of learners. By combining them it yields more stablepredictions than single learners and automatically gives more weight to the bestlearner in the ensemble.Typically, in classification predictions of single trees from an ensemble arecombined by voting. Each subject with given values of the predictor is droppedthrough every tree such that each single tree returns a predicted class. The pre-diction of the ensemble is the class most trees voted for. One obtains a majorityvote , which has also been called a committee method. It should be noted thatthe ensembles proposed here combine probabilities. They are not ensembles thatuse majority votes to combine class predictions obtained for each single learner.By computing the predicted class probabilities on can use more general accuracymeasures that also take into account the precision of the prediction. Moreover,with ordinal responses it is sensible not to use the mode of the class probabili-ties but the median computed over the predicted probabilities as predicted class.The use of estimated probabilities is of crucial importance. We also consideredmajority votes that combine the votes on splits but the results were distinctlyinferior to using probabilities. One way to investigate the power of a model is to investigate its ability to predictfuture observations. In discriminant analysis one often uses class prediction asa measure of performance. Class prediction in the considered framework comesin two forms. As predicted class one may use the mode of the response, ˆ Y =mod ( x ), which is in accordance with the Bayes prediction rule, or the medianˆ Y = med( x ), which makes use of the ordering of categories. Then for a newobservation ( Y , x ) one typically considers the 0-1 loss function L ( Y , ˆ Y ) = I ( Y (cid:54) = ˆ Y ) , where I ( . ) is the indicator function. One obtains 1 if the prediction is wrong, and0 if the prediction is correct. The average over new observations yields the 0-1error rate.Rather than giving just one value as a predictor for the class it is moreappropriate to consider the whole vector ˆ π T ( x ) = (ˆ π ( x ) , . . . , ˆ π k ( x )), whereˆ π r ( x ) = P ( Y r = r | x ) is the probability one obtains after fitting a tree. The vec-tor ˆ π ( x ) represents the predictive distribution. As Gneiting and Raftery (2007)postulated a desirable predictive distribution should be as sharp as possible and9ell calibrated. Sharpness refers to the concentration of the distribution andcalibration to the agreement between distribution and observation.Since the response is measured on an ordinal scale an appropriate loss func-tion derived from the continuous ranked probability score (Gneiting and Raftery(2007)) is L RP S ( Y , ˆ π ) = k (cid:88) r =1 (ˆ π ( r, x ) − I ( Y ≤ r )) , where ( Y , x ) is a new observation and ˆ π ( r, x ) = ˆ π ( x ) + · · · + ˆ π r ( x ) is thecumulative probability. It is a sum over quadratic (or Brier) scores for binary dataand takes the closeness between the whole distribution and the observed valueinto account. For alternative measures see also Gneiting and Raftery (2007). Heart Data
This data set includes 294 patients undergoing angiography at the HungarianInstitute of Cardiology in Budapest between 1983 and 1987, and is included inthe R package ordinalForest (Hornung, 2020). It contains ten covariates and oneordinal target variable. Explanatory variables are age (age in years), sex (1 =male; 0 = female), chest pain (1 = typical angina; 2 = atypical angina; 3 =non-anginal pain; 4 = asymptomatic), trestbps (blood pressure in mm Hg onadmission to the hospital), chol (serum cholestoral in mg/dl), fbs (fasting bloodsugar >
120 mg/dl, 1 = true; 0 = false) restecg (resting electrocardiographicresults, 1 = having ST-T wave abnormality, 0 = normal), thalach (maximumheart rate achieved), exang (exercise induced angina, 1 = yes; 0 = no), oldpeak(ST depression induced by exercise relative to rest). The response is Cat (severityof coronary artery disease determined using angiograms, 1 = no disease; 2 =degree 1; 3 = degree 2; 4 = degree 3; 5 = degree 4).
Birth Weight Data
The lobwt data set contained in the R package rpartOrdinal has been used inseveral random forest papers. As categorical response we use the birth weight bybinning the variable bwt according to the cutoffs: 2500,3000, and 3500 (see also(Galimberti et al., 2012)). Explanatory variables are age (age of mother in years),lwt (weight of mother at last menstrual period in Pounds), smoke (Smoking statusduring pregnancy, 1: No, 2: Yes), ht (history of hypertension, 1: No, 2: Yes), ftv(number of physician visits during the first trimester,1: None, 2: One, 3: Two,etc) 10
LES Data
The GLES data stem from the German Longitudinal Election Study (GLES),which is a long-term study of the German electoral process (Rattinger et al.,2014). The data consist of 2036 observations and originate from the pre-electionsurvey for the German federal election in 2017 and are concerned with politicalfears. In particular the participants were asked: “How afraid are you due to theuse of nuclear energy? The answers were measured on Likert scales from 1 (notafraid at all) to 7 (very afraid). The explanatory variables in the model are
Abitur (high school leaving certificate, 1: Abitur/A levels; 0: else),
Age (age of the par-ticipant),
EastWest (1: East Germany/former GDR; 0: West Germany/formerFRG),
Gender (1: female; 0: male),
Unemployment (1: currently unemployed;0: else).
Safety Data
The package CUB (Iannario et al., 2020) contains the data set relgoods, whichprovides results of a survey aimed at measuring the subjective extent of feelingsafe in the streets. The data were collected in the metropolitan area of Naples,Italy. Every participant was asked to assess on a 10 point ordinal scale his/herpersonal score for feeling safe with large categories referring to feeling safe. Thereare n = 2225 observations and five variables, Age , Gender (0: male, 1: female),the educational degree (
EduDegree ; 1: compulsory school, 2: high school diploma,3: Graduated-Bachelor degree, 4: Graduated-Master degree, 5: Post graduated),
WalkAlone (1 = usually walking alone, 0 = usually walking in company),
Resi-dence (1: City of Naples, 2: District of Naples, 3: Others Campania, 4: OthersItalia).
In the following the accuracy of prediction in the data sets described above isinvestigated. The data sets were split repeatedly into a learning set with samplesize n L and a validation set built from the rest of the data. (number of splits:30). The learning set was used to fit the method under investigation, the ac-curacy of prediction is then computed in the validation set. We use the rankedprobability score, since it indicates the accuracy of prediction better than simpleclass predictions. In addition, we give, for simplicity, the distance between thepredicted class and the true class. The latter measure is an indicator how far theprediction is from the true class.The fitting of split-based and adjacent categories random forests can be basedon different random forest methods for binary responses. In particular one canuse ordinalForest (Hornung, 2020), randomforest (Liaw et al., 2015), or condi-tional trees as provided by cforest (Hothorn and Zeileis, 2015). Figure 3 shows theaveraged ranked probability scores for fitted adjacent categories random forests11hen using ordinalForest (OrdRF), randomforest (RF), and cforest (CRF) forthe housing data and the GLES data. The differences in performance are negli-gible. Therefore, in the following we use only one method to generate split-basedand adjacent categories random forests, namely randomforest , which is computa-tionally quite efficient. OrdRF RF CRF . . . . OrdRF RF CRF . . . . Figure 3:
Ranked probability scores for housing data (left) and GLES data(right) when fitting the adjacent categories random forest with ordinalForest(OrdRF), randomforest (RF), and cforest (CRF).
The methods to be considered in the following are: • Pom : fitting of a proportional odds model, • Adj : fitting of an adjacent categories logit model, • RFord : fitting of an ordinal forest with ordinalForest , • RFSplit : Split-based ordinal random forest using randomForest to fit thebinary random forests, • RFadj : fitting of an adjacent categories random forest using randomForest to fit the binary random forests, • Ens3 : weighted ensemble including the proportional odds model, ordinal-Forest fit and adjacent categories random forest, • Ens5 : weighted ensemble including the proportional odds model, the adja-cent categories model, ordinalForest fit,and adjacent categories and split-based random forest.The last two methods are ensemble methods that include parametric mod-els.
Ens3 is built from one parametric model, an ordinal random forest, andthe adjacent categories random forest, whereas
Ens5 contains in addition theadjacent categories model and the split-based random forest. The ensemble builtfrom three methods serves to demonstrate that it is essential to combine ordinal12andom forests and parametric models. The inclusion of further models will beshown to improve the performance only slightly.Figure 4 shows the ranked probability scores (left column) and the distancebetween true response and predicted response (right column) for the validationdata. It is seen that ordinal random forests outperform parametric models forthe first three data sets only. A distinct advantage of ordinal forests is seen inparticular in the housing data set. For the heart and birth weight data setsthe performance of ordinal trees is only slightly superior. For the retinopathyand the GLES data simple parametric models perform much better than ordinalforests. The best performance is seen for the ensemble methods that combineparametric and nonparametric methods. They seem to efficiently combine thebest of two worlds yielding small errors for all data sets. Thus, if one wants toavoid ending up with an inferior prediction tool one should consider not onlytrees or parametric models but aim at a combination of these methods.In Figure the split-based based and adjacent categories forest utilize the ran-domForest method to fit the contained binary trees. Very similar performanceis found when using alternative methods to fit binary trees, like ctree or ordi-nalForest . These alternative methods yield different trees. When investigatingsingle trees the choice of the method definitely makes a difference, and specifictrees may offer advantages, for example conditional trees , which use tests in thesplitting procedure, are able to control the significance level and avoid selectionbias (Strobl et al., 2007; Hothorn et al., 2006) making them an attractive choice.However, for ensembles of trees as random forests the performance is very similar,at least in the case of split-based based and adjacent categories forests.13 om Adj RFord RFSplit RFadj Ens3 Ens5 . . . . Housing (n_L = 400)
Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . Housing (n_L = 400)
Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . . Heart (n_L = 200)
Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . Heart (n_L = 200)
Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . Birthweight (n_L = 160)
Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . . Birthweight (n_L = 160)
Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . Retinopathy (n_L = 500)
Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . Retinopathy (n_L = 500)
Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . GLES (n_L = 400)
Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . GLES (n_L = 400)
Figure 4:
Results for data sets; left: ranked probability score, right: distancebetween true and estimated value. Importance of Variables
While single trees for split variables are easy to interpret this does not hold forensembles of trees. Since variables appear in different trees at different positionsthe impact of variables is hard to infer from plots of hundreds of trees. On theother hand random forests allow for complex effects of predictors, which makesit a flexible prediction tool.There is a considerable amount of literature that deals with the developmentof importance measures for random forests, see, for example, Strobl et al. (2007,2008); Hapfelmeier et al. (2014); Gregorutti et al. (2017); Hothorn and Zeileis(2015). A naive measure simply counts the number of times each variable is se-lected by the individual trees in the ensemble. Better, more elaborate variableimportance measures incorporate a (weighted) mean of the individual trees’ im-provement in the splitting criterion produced by each variable. An example forsuch a measure is the ”Gini importance” available in the randomForest package.It describes the improvement in the ”Gini gain” splitting criterion. Alternative,and better variable importance measures are based on permutations yieldingso-called permutation accuracy importance measures (Strobl et al., 2007). Byrandomly permuting single predictor variables X j , the original association withthe response Y is broken. When the permuted variable X j , together with theremaining un-permuted predictor variables, is used to predict the response, theprediction accuracy is supposed to decrease if the variable X j had an additionalimpact on explaining the response. The difference in prediction accuracy beforeand after permuting X j yields a permutation accuracy importance measure.In the following we use the heart data to illustrate how importance measurescan be obtained for split-based and adjacent categories random forests. Of courseit depends on the algorithm that is used to grow binary trees which importancemeasure can be computed. Figure 5 shows the Gini importance when usingrandomForest to fit the binary random forests. In the upper panels one sees theimportance measures obtained for the split variables, that is, for conditional splitsin adjacent categories RF on the left, and direct splits for split-based RF on theright. The numbers 1 to 4 indicate the splits. For example, 3 means that the splitis between categories { , , } and { } . It is seen that the first six variables showstrong importance with the importance being stronger for lower categories splitsand weaker for higher category splits. The lower panel shows the importancemeasures averaged across the splits. The lower curves, which are almost identical,show the average for the adjacent categories and split-based random forest. Itshows, in addition, the Gini importance for the multi-category random forestobtained from randomForest. It is seen that the importance measures have thesame order for all the fitted random forests. That the values of importance forthe multi-category random forest is higher than for the other two forests is merelya scaling effect.Figure 6 shows the corresponding picture if conditional trees are used (cforest).15onditional trees avoid the bias that is found if categorical variables with varyingnumbers of categories and a mixture of categorical and continuous predictorsare used, see, for example, (Strobl et al., 2007). Consequently, the obtainedimportance measures differ from the Gini importance measures. It is seen thatvariable 1,2 and 7 are very influential. In particular the importance of variable 1,which is a categorical variable, is more distinct than in Gini importance measures. Heart data
Variables
Heart data
Variables
Heart data
Variables
Figure 5:
Gini importance for heart data; variables 1 to 10: chest pain, oldpeak,age, trestbps, chol, thalach, exang, sex, fbs, restecg (randomForest fit); left upperpanel: importance for conditional splits in adjacent categories RF, right upperpanel: importance for splits in split-based RF, lower panel: averaged importancemeasures for splits in adjacent categories and split-based RF (lower curves) andmulti-categorical fit of randomForest (upper curve). − . . . . Heart data
Variables . . . . Heart data
Variables . . . . Heart data
Variables
Figure 6:
Importance for heart data; variables 1 to 10: chest pain, oldpeak,age, trestbps, chol, thalach, exang, sex, fbs, restecg (cforest fit); left upper panel:importance for conditional splits in adjacent categories RF, right upper panel:importance for splits in split-based RF, lower panel: averaged importance mea-sures for splits in adjacent categories (lower curve) and split-based RF (uppercurve).
The split variables, which are the building blocks of ordinal models, have beenused to develop ordinal trees and random forests. The basic concept can also beused to generate alternative parametric or nonparametric classification methodsthat account for the order in responses. One can, for example, use two-class lineardiscriminant analysis or binary models with variables selection by lasso in the caseof many predictors, or use nonparametric methods as the nearest neighborhoodclassifier for two classes. All of these methods can be used to model the splitvariables conditionally or unconditionally. In the present paper we restrictedconsideration to random forests since the objective was to construct score-freerandom forests.Also the more recently proposed ordinal random forests are in some way in-spired by parametric ordinal models but in a different way than the split variablesapproach propagated here. The score-free random forests proposed by Buri andHothorn (2020) follow a quite different strategy to obtain random forests. Theyfit a cumulative logit model and use the likelihood contributions of the observa-17ions to obtain test statistics. The core idea is to regress the obtained partialderivatives of the log-likelihood on prognostic variables. By using the cumula-tive model the order of categories is used without the need for assigned scores.But it should be noted that the “pure” cumulative model is fitted in subpopu-lations without including predictors. The ordinal forest propagated by Hornung(2020) also uses the cumulative logistic model. It exploits the latent continuousresponse variable underlying the observed ordinal response variable by explicitlyusing the widths of the adjacent intervals in the range of the continuous responsevariable. These intervals are considered as corresponding to the classes of theordinal response variable. That means, “the ordinal response variable is treatedas a continuous variable, where the differing extents of the individual classes ofthe ordinal response variable are implicitly taken into account” (Hornung, 2020).The approach is closely related to conventional random forests for continuousoutcomes but optimizes the assigned scores instead of considering them as given,and therefore is score-free in a certain sense.
References
Agresti, A. (2010).
Analysis of Ordinal Categorical Data, 2nd Edition . New York:Wiley.Anderson, J. A. (1984). Regression and ordered categorical variables.
Journal ofthe Royal Statistical Society B 46 , 1–30.Anderson, J. A. and R. R. Phillips (1981). Regression, discrimination and mea-surement models for ordered categorical variables.
Applied Statistics 30 , 22–31.Andrich, D. (2013). An expanded derivation of the threshold structure of thepolytomous Rasch model that dispels any ’threshold disorder controversy’.
Ed-ucational and Psychological Measurement 73 (1), 78–124.Archer, K. J. (2010). rpartordinal: an R package for deriving a classification treefor predicting an ordinal response.
Journal of Statistical Software 34 , 7.Bender, R. and U. Grouven (1998). Using binary logistic regression models forordinal data with non–proportional odds.
Journal of Clinical Epidemiology 51 ,809–816.Brant, R. (1990). Assessing proportionality in the proportional odds model forordinal logistic regression.
Biometrics 46 , 1171–1178.Breiman, L. (1996). Bagging predictors.
Machine Learning 24 , 123–140.Breiman, L. (2001). Random forests.
Machine Learning 45 , 5–32.18¨uhlmann, P., B. Yu, et al. (2002). Analyzing bagging.
The Annals of Statis-tics 30 (4), 927–961.Buri, M. and T. Hothorn (2020). Model-based random forests for ordinal regres-sion.
The International Journal of Biostatistics 1 (ahead-of-print).Campbell, M. K. and A. P. Donner (1989). Classification efficiency of multino-mial logistic-regression relative to ordinal logistic-regression.
Journal of theAmerican Statistical Association 84 (406), 587–591.Campbell, M. K., A. P. Donner, and K. M. Webster (1991). Are ordinal modelsuseful for classification?
Statistics in Medicine 10 , 383–394.Chu, W. and S. S. Keerthi (2007). Support vector ordinal regression.
Neuralcomputation 19 (3), 792–815.Cox, C. (1995). Location-scale cumulative odds models for ordinal data: Ageneralized non-linear model approach.
Statistics in Medicine 14 , 1191–1203.Fernandez, D., I. Liu, and R. Costilla (2019). A method for ordinal outcomes:The ordered stereotype model.
International Journal of Methods in PsychiatricResearch , e1801.Galimberti, G., G. Soffritti, and M. Di Maso (2012). Classification trees for ordi-nal responses in R: the rpartscore package.
Journal of Statistical Software 47 .Gneiting, T. and A. Raftery (2007). Strictly proper scoring rules, prediction, andestimation.
Journal of the American Statistical Association 102 (477), 359–376.Goodman, L. A. (1981a). Association models and canonical correlation in theanalysis of cross-classification having ordered categories.
Journal of the Amer-ican Statistical Association 76 , 320–334.Goodman, L. A. (1981b). Association models and the bivariate normal for con-tingency tables with ordered categories.
Biometrika 68 , 347–355.Greenland, S. (1994). Alternative models for ordinal logistic regression.
Statisticsin Medicine 13 , 1665–1677.Gregorutti, B., B. Michel, and P. Saint-Pierre (2017). Correlation and variableimportance in random forests.
Statistics and Computing 27 (3), 659–678.Hapfelmeier, A., T. Hothorn, K. Ulm, and C. Strobl (2014). A new variable im-portance measure for random forests with missing data.
Statistics and Com-puting 24 (1), 21–34.Hornung, R. (2020). Ordinal forests.
Journal of Classification 37 , 4–17.19othorn, T., K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning:A conditional inference framework.
Journal of Computational and GraphicalStatistics 15 , 651–674.Hothorn, T. and A. Zeileis (2015). partykit: A modular toolkit for recursivepartytioning in r.
The Journal of Machine Learning Research 16 (1), 3905–3909.Iannario, M., D. Piccolo, and R. Simone (2020). CUB: a class of mix-ture models for ordinal data. R package version 1.1.4, http://cran.r-project.org/package=cub.Janitza, S., G. Tutz, and A.-L. Boulesteix (2016). Random forest for ordinalresponses: prediction and variable selection.
Computational Statistics & DataAnalysis 96 , 57–73.Kateri, M. (2014).
Contingency table analysis . Springer.Kim, J.-H. (2003). Assessing practical significance of the proportional odds as-sumption.
Statistics & probability letters 65 (3), 233–239.Liaw, A., M. Wiener, L. Breiman, and A. Cutler (2015). Package randomforest.Liu, I., B. Mukherjee, T. Suesse, D. Sparrow, and S. K. Park (2009). Graphicaldiagnostics to check model misspecification for the proportional odds regressionmodel.
Statistics in medicine 28 (3), 412–429.Masters, G. N. (1982). A Rasch model for partial credit scoring.
Psychome-trika 47 , 149–174.Masters, G. N. and B. Wright (1984). The essential process in a family of mea-surement models.
Psychometrika 49 , 529–544.McCullagh, P. (1980). Regression model for ordinal data (with discussion).
Jour-nal of the Royal Statistical Society B 42 , 109–127.Muraki, E. (1997). A generalized partial credit model.
Handbook of modern itemresponse theory , 153–164.Peterson, B. and F. E. Harrell (1990). Partial proportional odds models forordinal response variables.
Applied Statistics 39 , 205–217.Rattinger, H., S. Roßteutscher, R. Schmitt-Beck, B. Weßels, and C. Wolf (2014).Pre-election cross section (GLES 2013).
GESIS Data Archive, Cologne ZA5700Data file Version 2.0.0 . 20udolfer, S. M., P. C. Watson, and E. Lesaffre (1995). Are ordinal models use-ful for classification? a revised analysis.
Journal of Statistical ComputationSimulation 52 (2), 105–132.Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008). Con-ditional variable importance for random forests.
BMC bioinformatics 9 (1),307.Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn (2007). Bias in randomforest variable importance measures: Illustrations, sources and a solution.
BMCbioinformatics 8 (1), 25.Tutz, G. (2012).
Regression for Categorical Data . Cambridge University Press.Tutz, G. (2021). Ordinal regression: a review and a taxonomy of models.