[PDF] Ordinal Trees and Random Forests: Score-Free Recursive Partitioning and Improved Ensembles

Abstract

Existing ordinal trees and random forests typically use scores that are assigned to the ordered categories, which implies that a higher scale level is used. Versions of ordinal trees are proposed that take the scale level seriously and avoid the assignment of artificial scores. The basic construction principle is based on an investigation of the binary models that are implicitly used in parametric ordinal regression. These building blocks can be fitted by trees and combined in a similar way as in parametric models. The obtained trees use the ordinal scale level only. Since binary trees and random forests are constituent elements of the trees one can exploit the wide range of binary trees that have already been developed. A further topic is the potentially poor performance of random forests, which seems to have ignored in the literature. Ensembles that include parametric models are proposed to obtain prediction methods that tend to perform well in a wide range of settings. The performance of the methods is evaluated empirically by using several data sets.

Full PDF

OOrdinal Trees and Random Forests: Score-FreeRecursive Partitioning and Improved Ensembles

Gerhard Tutz

Ludwig-Maximilians-Universit¨at M¨unchenAkademiestraße 1, 80799 M¨unchen

February 2, 2021

Abstract

Existing ordinal trees and random forests typically use scores that are as-signed to the ordered categories, which implies that a higher scale levelis used. Versions of ordinal trees are proposed that take the scale levelseriously and avoid the assignment of artiﬁcial scores. The basic construc-tion principle is based on an investigation of the binary models that areimplicitly used in parametric ordinal regression. These building blocks canbe ﬁtted by trees and combined in a similar way as in parametric models.The obtained trees use the ordinal scale level only. Since binary trees andrandom forests are constituent elements of the trees one can exploit thewide range of binary trees that have already been developed. A furthertopic is the potentially poor performance of random forests, which seemsto have ignored in the literature. Ensembles that include parametric mod-els are proposed to obtain prediction methods that tend to perform wellin a wide range of settings. The performance of the methods is evaluatedempirically by using several data sets.

Keywords: Recursive partitioning, trees; random forests; ensemble methods;ordinal regression

There is a long tradition of analyzing ordinal response data by using paramet-ric models, which started with the seminal paper of (McCullagh, 1980). Morerecently, recursive partitioning methods have been developed that allow to inves-tigate the impact of explanatory variables by non-parametric tools. Single andrandom trees for ordinal responses have several advantages, they can be appliedto large data sets and are considered to perform very well in prediction.1 a r X i v : . [ s t a t . M E ] J a n problem with most of the ordinal trees is that they assume that scoresare assigned to the ordered categories of the response. The assignment of scorescan be warranted in some cases, in particular if ordinal responses are built fromcontinuous variables by grouping. However, it is rather artiﬁcial and arbitraryin genuine ordinal response data, for example, if the response represents orderedlevels of severeness of a disease. Then one can not choose the midpoints of theintervals from which the ordered response is built as suggested by Hothorn et al.(2006) since no continuous variable is observed. If nevertheless scores are assignedthey can aﬀect the prediction results although that has not always to be the case,see also Janitza et al. (2016).The packages rpartOrdinal (Archer, 2010) as well as the improved version rpartScore (Galimberti et al., 2012), which are based on the Gini impurity func-tion, use assigned scores. The same holds for the random forests proposed byJanitza et al. (2016) and the ordinal version of conditional trees of the pack-age party (Hothorn et al., 2006; Hothorn and Zeileis, 2015). The random forestapproach proposed by Hornung (2020) is somewhat diﬀerent, it also translatesordinal measurements into continuous scores but optimizes scores instead of us-ing a ﬁxed score. Versions of random forests without scores were proposed morerecently by Buri and Hothorn (2020). They use the ordinal proportional oddsmodel to obtain statistics that are used in splitting.In the following alternative trees and random forests that take the scale levelof the response seriously are proposed. The main concept is that ordinal responsescontain binary responses as building blocks. This has already been implicitly usedin parametric modeling approaches. For example, the widely used proportionalodds model can be seen as a model that parameterizes the split of responsecategories into two groups of adjacent categories. But the principle also holds foralternative models as the adjacent categories model and the sequential model,see Tutz (2021) for an overview and a taxonomy of ordinal regression models.The proposed trees explicitly use the representation of ordinal responses as a setof binary variables. Random forests for the binary variables are used to obtainrandom forests for ordinal response data.For random forests it is important that they provide good performance interms of prediction. They are commonly considered as being very eﬃcient. How-ever, as will be demonstrated this does not hold in general. In many cases simpleparametric models turn out to be at least as eﬃcient and sometimes more eﬃ-cient than the carefully designed random forests. Typically, when ordinal forestsare propagated the accuracy is investigated for versions of random forests onlybut they are not compared to parametric competitors. In the following we prop-agate the use of ensembles that include parametric models to provide a stableprediction tool that works well in all kinds of data sets.The paper has two objectives, introducing score-free recursive partitioning andrandom forests, and propagating ensembles that include parametric models. InSection the representation of ordinal responses as a sequence of binary responses2s brieﬂy considered. It makes clear that speciﬁc binary responses can be seen asbuilding blocks of classical parametric models. In Section 3 it is shown how thesebuilding blocks can be used to construct score-free trees and random forests. Inaddition, more general ensembles are considered. In Section 4 the performanceof the ensembles is investigated by using real data sets. Section 5 is devoted toimportance measures, which are an essential ingredient of random forests sincethe impact of variables on prediction in random forests is not directly available. In the following the representation of ordinal response as a collection of binaryresponses is considered. It can be seen as being behind the construction of para-metric ordinal models and will serve to construct a novel type of recursive parti-tioning that does not use assigned scores.Let the ordinal response Y take values from { , . . . , k } . Although these val-ues suggest a univariate response the actual response is multivariate since thenumbers 1 , . . . , k just represent that outcomes are ordered but distances betweennumbers assigned to categories should not be built in an ordinal scale becausethey are not interpretable.A multivariate representation of the outcome can be obtained by using binarydummy variables. Natural candidates for dummy variables are the split variables Y r = (cid:26) Y ≥ r Y < r. (1)Then Y = r is represented by a sequence of ones followed by a sequence of zeros,( Y . . . , Y k ) = (1 , . . . , , , . . . , . The vector ( Y . . . , Y k ) can be seen as a multivariate representation of the re-sponse. The dummy variables that generate vectors of this form, which arecharacterized by a sequence of ones followed by a sequence of zeros have also bereferred to as Guttman variables (Andrich, 2013).Classical ordinal regression model use these dummy variables but are mostoften derived from the assumption of an underlying continuous variable, and thelink to split variables is ignored. The most widely used proportional odds model ,also called cumulative logistic model , has the form P ( Y ≥ r | x ) = F ( β r + x T β ) , r = 2 , . . . , k. (2)where x is a vector of explanatory variables and F ( η ) = exp( η ) / (1 + exp( η ))is the logistic distribution function. For the parameters one has the restriction β ≥ · · · ≥ β k . The model explicitly uses the dichotomizations given by (1).Since Y ≥ r iﬀ Y r = 1 the model can also be given as P ( Y r = 1 | x ) = F ( β r + x T β ) , r = 2 , . . . , k. (3)3hus, the proportional odds model is equivalent to a collection of binary logitmodels that have to hold simultaneously. The model implies that the eﬀect ofcovariates contained in x T β is the same for all dichotomizations. That meansif one ﬁts the binary models (3) separately one should obtain similar values forestimates of β . This restriction can be weakened by using the partial proportionalodds model, in which the eﬀect of variables may depend on the category, that is,the linear term x T β in (2) is replaced by x T β r .Model (2) is a so-called cumulative model since on the left hand side one hasthe sum of probabilities P ( Y ≥ r | x ). Cumulative models form a whole familyof models, whose members are characterized by the choice of a speciﬁc strictlyincreasing distribution function F ( . ). They have been investigated and extended,among others, by McCullagh (1980), Brant (1990), Peterson and Harrell (1990),Bender and Grouven (1998), Cox (1995), Kim (2003) and Liu et al. (2009).An alternative ordinal regression model is the adjacent categories model , whichhas the basic form P ( Y ≥ r | Y ∈ { r − , r } , x ) = F ( β r + x T β ) , r = 2 , . . . , k. (4)Since P ( Y ≥ r | Y ∈ { r − , r } , x ) = P ( Y = r | Y ∈ { r − , r } , x ) it speciﬁes theprobability of observing category r given the response is in categories { r − , r } .Because of the conditioning it can be seen as a local model. The interesting pointis that it also uses the split variabales. It is easily seen that it is equivalent to P ( Y r = 1 | Y r − = 1 , Y r +1 = 0 , x ) = F ( β r + x T β ) , r = 2 , . . . , k. (5)Thus, it speciﬁes the binary response variable Y r conditionally in contrast tocumulative models, which determine the binary response directly in an uncondi-tional way. But as for cumulative models it is assumed that the binary models(5) hold simultaneously.The adjacent categories logit model may also be considered as the regressionmodel that is obtained from the row-column (RC) association model consideredby Goodman (1981a,b), Kateri (2014). It is also related to Anderson’s stereotypemodel (Anderson, 1984), which was considered by Greenland (1994) and Fernan-dez et al. (2019). It has been most widely used as a latent trait model in the formof the partial credit model (Masters, 1982; Masters and Wright, 1984; Muraki,1997).An advantage of the adjacent categories model is that one can replace the pa-rameter vector β by a category-speciﬁc parameter vector β r without running intoproblems. In cumulative models one has the restriction P ( Y ≥ | x ) ≥ . . . P ( Y ≥ k | x ), which can yield problems, in partivullar when ﬁtting the binary models (3)with category-speciﬁc parameter vectors β r . For overviews of parametric ordinalmodels, see, for example, Agresti (2010), Tutz (2012). They also include a thirdtype of ordinal model, the sequential model, which is a speciﬁc process model,4hich could also be extended to tree type models. But because of its speciﬁcnature we do not consider it explicitly.The main point is that binary models are at the core of parametric ordinalmodels. There is a good reason for that because the splits represent the order incategories without assuming more than an order of categories. In the followingthis is exploited to construct trees that account for the ordering of categories. The crucial role of split variables in modeling ordered response can be used toobtain non-parametric tree models that use the ordering eﬃciently. There arebasically two ways to do so, one is by using the split variables directly, whichcorresponds to cumulative type models, the other approach is to use them con-ditionally, which corresponds to the adjacent categories approach.

Split variables are binary and therefore binary trees can be ﬁtted. Let the treefor Y r be given by log P ( Y r = 1 | x ) P ( Y r = 0 | x ) = tr r ( x ) , r = 2 , . . . , k, (6)where tr r ( x ) denotes the partitioning of the predictor space, that is, the tree.Then one obtains for the probabilities P ( Y r = r | x ) = P ( Y ≥ r | x ) = exp(tr r ( x ))1 + exp(tr r ( x )) . The corresponding trees are called split-based trees . Split variables are a formaltool to group categories but have substantial meaning in many applications. Forexample, in the retinopathy data set (Bender and Grouven, 1998), which will alsobe considered later, the response categories are 1: no retinopathy, 2: nonprolif-erative retinopathy, 3: advanced retinopathy or blind. Thus the split betweencategories { } and { , } distinguishes between healthy and not healthy, whereasthe split between { , } and { } distinguishes between serious illness and other-wise. It is crucial that explanatory variables may play diﬀerent roles for diﬀerentsplits. In the retinopathy data set, with explanatory variables smoking (SM =1: smoker, SM = 0: non-smoker), diabetes duration (DIAB) measured in years,glycosylated hemoglobin (GH), measured in percent, and diastolic blood pressure(BP) measured in mmHg, one obtains for the two splits the trees shown in Figure1 (ﬁtted by using ctree , Hothorn et al. (2006)). It is seen that trees are quitediﬀerent, which means that explanatory variables play diﬀering roles when usedto distinguish between healthy and not healthy and between serious illness andless serious illness. 5 IABp < 0.0011 £ > £ > £ > £ > £ > £ > £ > £ > £ > £ > £ > Figure 1:

Conditional trees for retinopathy data, upper panel: split between { } and { , } . lower panel: split between { , } and { } . Instead of the unconditional split variables considered previously let us considerthe conditional binary variables˜ Y r = (cid:26) Y ≥ r given Y ∈ { r − , r } Y < r given Y ∈ { r − , r } , (7) r = 2 , . . . , k . The variables are conditional versions of split variables. Moreconcrete, ˜ Y r represents Y r | Y r − = 1 , Y r +1 = 0. The main diﬀerence between˜ Y r and Y r is that the former is a conditional variable. This is important sinceﬁtting a tree to ˜ Y r means one includes only observations with Y ∈ { r − , r } .The corresponding tree can be seen as a nonparametric version of the adjacentcategories model and is called an adjacent categories tree . The correspondingtrees are local, they reﬂect the impact of explanatory variables on the distinctionbetween adjacent categories.Adjacent categories trees have a diﬀerent interpretation than trees for splitvariables. For illustration, Figure 2 shows the ﬁtted trees for the retinopathydata. It is seen that diabetes duration (DIAB) has an impact in both trees. Inthe split between categories 1 and 2 the only other variable that is signiﬁcant is6lycosylated hemoglobin while in the split between categories 2 and 3 it is bloodpressure. Trees are smaller than split-based trees since due to conditioning thenumber of observations is smaller. From a substantial point of view it might bemost interesting to combine trees from the diﬀerent splitting concepts. The ﬁrsttree in Figure 1 distinguishes between { } and { , } , that is between healthyand non healthy. The second tree in Figure 2 shows which variables are signif-icant when distinguishing between categories 2 and 3 given the response is incategories { , } , that is, which variables are inﬂuential given the patient suﬀersfrom retinopathy. DIABp < 0.0011 £ > £ > £ > £ > Figure 2:

Conditional trees for retinopathy data, left: split given Y ∈ { , } ,right: split given Y ∈ { , } . Single trees can be informative for researchers that want to investigate which vari-ables have an impact on speciﬁc dichotomizations. If one has prediction in minda better choice are random trees, which are much more stable and eﬃcient thansingle trees (Breiman, 1996, 2001; B¨uhlmann et al., 2002). Then it is necessaryto combine the results of single trees in a proper way.Let us ﬁrst consider split-based trees. They face the problem familiar fromcumulative models with category-speciﬁc eﬀects that speciﬁc constraints have tobe fulﬁlled. More speciﬁcally, for all values of x the constraint P ( Y ≥ | x ) ≥· · · ≥ P ( Y ≥ k − | x ) has to hold, which is equivalent to P ( Y = 1 | x ) ≥ · · · ≥ P ( Y k − = 1 | x ). However, for separately ﬁtted trees the corresponding conditiontr ( x ) ≥ · · · ≥ tr k − ( x ) does no necessarily hold. The same problem occurs inpartial proportional odds model, for which β + x T β ≥ · · · ≥ β k + x T β k hasto hold.Let ˆ π ( x ) ( r ) = ˆ P ( Y ≥ r | x ) denote the estimated cumulative probabilitiesresulting from the tree for the split variable Y r . Then probabilities are obtainedby ˆ P ( Y = r | x ) = ˆ π ( x ) ( r ) − ˆ π ( x ) ( r +1) if ˆ π ( x ) ( r ) ≥ ˆ π ( x ) ( r +1) for all r . If the lattercondition does not hold cumulative probabilities ˆ π ( x ) ( r ) , . . . , ˆ π ( x ) ( k ) are ﬁttedto be decreasing by using monotone regression tools. Alternative approaches7o obtain compatible estimators have been considered in the machine learningcommunity, for example, by Chu and Keerthi (2007).An advantage of adjacent categories trees is that no monotonization toolsare needed since estimated probabilities are always compatible. Let the adjacentcategories trees be given bylog P ( ˜ Y r = 1 | x ) P ( ˜ Y r = 0 | x ) = ˜tr r ( x ) , r = 2 , . . . , k, (8)where ˜tr r ( x ) denotes the partitioning of the predictor space. It is not hard toderive that the probability of an response in category r given the representation(8) holds has the form P ( Y r = r | x ) = exp( (cid:80) rs =2 ˜tr s ( x )) (cid:80) ks =1 exp( (cid:80) sl =2 ˜tr s ( x )) , (9)where (cid:80) l =2 ˜tr s ( x )) = 0. The representation (9) holds for any values of˜tr ( x ) , . . . , ˜tr k ( x ), no speciﬁc restriction has to be fulﬁlled.Random forests are obtained by combining not only the trees for split vari-ables but averaging over a multitude of trees generated by randomization. Theapproach proposed here exploits the role of the split variables as building blocksfor ordinal responses, and can be seen as a split variables based approach, whichis unconditional in split-based trees and conditional in adjacent categories trees. Before investigating the proposed random forests in detail let us point out a prob-lem with ordinal trees that is often ignored. Most presentations of ordinal treesfocus on the development of novel trees but do not compare the performanceof random forests to the performance of simple parametric models as the pro-portional odds model. That leaves the impression that random forests are themost eﬃcient tools. As will be demonstrated in the following sections parametricmodels should not be ignored, in many applications they can perform as wellas random forests or even better. The use of parametric ordinal models for theprediction of ordinal responses has some tradition, see, for example, Rudolferet al. (1995), Campbell and Donner (1989), Campbell et al. (1991), Andersonand Phillips (1981).Trees themselves are ensemble methods that combine various trees to obtaina good approximation of the underlying response probabilities. To exploit thepotential strength of parametric models we propose an ensemble that includesthese models. When estimating response probabilities we will use the ensembleˆ P ( Y = r | x ) = M (cid:88) j =1 w j ˆ P j ( Y = r | x ) , P j ( Y = r | x ) are estimated probabilities for the j -th learner. Learnerscan be random forests but also parametric models. The weights w j are chosenaccording to the prediction performance of the j -th learner. The ensemble eﬃ-ciently uses diﬀerent types of learners. By combining them it yields more stablepredictions than single learners and automatically gives more weight to the bestlearner in the ensemble.Typically, in classiﬁcation predictions of single trees from an ensemble arecombined by voting. Each subject with given values of the predictor is droppedthrough every tree such that each single tree returns a predicted class. The pre-diction of the ensemble is the class most trees voted for. One obtains a majorityvote , which has also been called a committee method. It should be noted thatthe ensembles proposed here combine probabilities. They are not ensembles thatuse majority votes to combine class predictions obtained for each single learner.By computing the predicted class probabilities on can use more general accuracymeasures that also take into account the precision of the prediction. Moreover,with ordinal responses it is sensible not to use the mode of the class probabili-ties but the median computed over the predicted probabilities as predicted class.The use of estimated probabilities is of crucial importance. We also consideredmajority votes that combine the votes on splits but the results were distinctlyinferior to using probabilities. One way to investigate the power of a model is to investigate its ability to predictfuture observations. In discriminant analysis one often uses class prediction asa measure of performance. Class prediction in the considered framework comesin two forms. As predicted class one may use the mode of the response, ˆ Y =mod ( x ), which is in accordance with the Bayes prediction rule, or the medianˆ Y = med( x ), which makes use of the ordering of categories. Then for a newobservation ( Y , x ) one typically considers the 0-1 loss function L ( Y , ˆ Y ) = I ( Y (cid:54) = ˆ Y ) , where I ( . ) is the indicator function. One obtains 1 if the prediction is wrong, and0 if the prediction is correct. The average over new observations yields the 0-1error rate.Rather than giving just one value as a predictor for the class it is moreappropriate to consider the whole vector ˆ π T ( x ) = (ˆ π ( x ) , . . . , ˆ π k ( x )), whereˆ π r ( x ) = P ( Y r = r | x ) is the probability one obtains after ﬁtting a tree. The vec-tor ˆ π ( x ) represents the predictive distribution. As Gneiting and Raftery (2007)postulated a desirable predictive distribution should be as sharp as possible and9ell calibrated. Sharpness refers to the concentration of the distribution andcalibration to the agreement between distribution and observation.Since the response is measured on an ordinal scale an appropriate loss func-tion derived from the continuous ranked probability score (Gneiting and Raftery(2007)) is L RP S ( Y , ˆ π ) = k (cid:88) r =1 (ˆ π ( r, x ) − I ( Y ≤ r )) , where ( Y , x ) is a new observation and ˆ π ( r, x ) = ˆ π ( x ) + · · · + ˆ π r ( x ) is thecumulative probability. It is a sum over quadratic (or Brier) scores for binary dataand takes the closeness between the whole distribution and the observed valueinto account. For alternative measures see also Gneiting and Raftery (2007). Heart Data

This data set includes 294 patients undergoing angiography at the HungarianInstitute of Cardiology in Budapest between 1983 and 1987, and is included inthe R package ordinalForest (Hornung, 2020). It contains ten covariates and oneordinal target variable. Explanatory variables are age (age in years), sex (1 =male; 0 = female), chest pain (1 = typical angina; 2 = atypical angina; 3 =non-anginal pain; 4 = asymptomatic), trestbps (blood pressure in mm Hg onadmission to the hospital), chol (serum cholestoral in mg/dl), fbs (fasting bloodsugar >

120 mg/dl, 1 = true; 0 = false) restecg (resting electrocardiographicresults, 1 = having ST-T wave abnormality, 0 = normal), thalach (maximumheart rate achieved), exang (exercise induced angina, 1 = yes; 0 = no), oldpeak(ST depression induced by exercise relative to rest). The response is Cat (severityof coronary artery disease determined using angiograms, 1 = no disease; 2 =degree 1; 3 = degree 2; 4 = degree 3; 5 = degree 4).

Birth Weight Data

The lobwt data set contained in the R package rpartOrdinal has been used inseveral random forest papers. As categorical response we use the birth weight bybinning the variable bwt according to the cutoﬀs: 2500,3000, and 3500 (see also(Galimberti et al., 2012)). Explanatory variables are age (age of mother in years),lwt (weight of mother at last menstrual period in Pounds), smoke (Smoking statusduring pregnancy, 1: No, 2: Yes), ht (history of hypertension, 1: No, 2: Yes), ftv(number of physician visits during the ﬁrst trimester,1: None, 2: One, 3: Two,etc) 10

LES Data

The GLES data stem from the German Longitudinal Election Study (GLES),which is a long-term study of the German electoral process (Rattinger et al.,2014). The data consist of 2036 observations and originate from the pre-electionsurvey for the German federal election in 2017 and are concerned with politicalfears. In particular the participants were asked: “How afraid are you due to theuse of nuclear energy? The answers were measured on Likert scales from 1 (notafraid at all) to 7 (very afraid). The explanatory variables in the model are

Abitur (high school leaving certiﬁcate, 1: Abitur/A levels; 0: else),

Age (age of the par-ticipant),

EastWest (1: East Germany/former GDR; 0: West Germany/formerFRG),

Gender (1: female; 0: male),

Unemployment (1: currently unemployed;0: else).

Safety Data

The package CUB (Iannario et al., 2020) contains the data set relgoods, whichprovides results of a survey aimed at measuring the subjective extent of feelingsafe in the streets. The data were collected in the metropolitan area of Naples,Italy. Every participant was asked to assess on a 10 point ordinal scale his/herpersonal score for feeling safe with large categories referring to feeling safe. Thereare n = 2225 observations and ﬁve variables, Age , Gender (0: male, 1: female),the educational degree (

EduDegree ; 1: compulsory school, 2: high school diploma,3: Graduated-Bachelor degree, 4: Graduated-Master degree, 5: Post graduated),

WalkAlone (1 = usually walking alone, 0 = usually walking in company),

Resi-dence (1: City of Naples, 2: District of Naples, 3: Others Campania, 4: OthersItalia).

In the following the accuracy of prediction in the data sets described above isinvestigated. The data sets were split repeatedly into a learning set with samplesize n L and a validation set built from the rest of the data. (number of splits:30). The learning set was used to ﬁt the method under investigation, the ac-curacy of prediction is then computed in the validation set. We use the rankedprobability score, since it indicates the accuracy of prediction better than simpleclass predictions. In addition, we give, for simplicity, the distance between thepredicted class and the true class. The latter measure is an indicator how far theprediction is from the true class.The ﬁtting of split-based and adjacent categories random forests can be basedon diﬀerent random forest methods for binary responses. In particular one canuse ordinalForest (Hornung, 2020), randomforest (Liaw et al., 2015), or condi-tional trees as provided by cforest (Hothorn and Zeileis, 2015). Figure 3 shows theaveraged ranked probability scores for ﬁtted adjacent categories random forests11hen using ordinalForest (OrdRF), randomforest (RF), and cforest (CRF) forthe housing data and the GLES data. The diﬀerences in performance are negli-gible. Therefore, in the following we use only one method to generate split-basedand adjacent categories random forests, namely randomforest , which is computa-tionally quite eﬃcient. OrdRF RF CRF . . . . OrdRF RF CRF . . . . Figure 3:

Ranked probability scores for housing data (left) and GLES data(right) when ﬁtting the adjacent categories random forest with ordinalForest(OrdRF), randomforest (RF), and cforest (CRF).

The methods to be considered in the following are: • Pom : ﬁtting of a proportional odds model, • Adj : ﬁtting of an adjacent categories logit model, • RFord : ﬁtting of an ordinal forest with ordinalForest , • RFSplit : Split-based ordinal random forest using randomForest to ﬁt thebinary random forests, • RFadj : ﬁtting of an adjacent categories random forest using randomForest to ﬁt the binary random forests, • Ens3 : weighted ensemble including the proportional odds model, ordinal-Forest ﬁt and adjacent categories random forest, • Ens5 : weighted ensemble including the proportional odds model, the adja-cent categories model, ordinalForest ﬁt,and adjacent categories and split-based random forest.The last two methods are ensemble methods that include parametric mod-els.

Ens3 is built from one parametric model, an ordinal random forest, andthe adjacent categories random forest, whereas

Ens5 contains in addition theadjacent categories model and the split-based random forest. The ensemble builtfrom three methods serves to demonstrate that it is essential to combine ordinal12andom forests and parametric models. The inclusion of further models will beshown to improve the performance only slightly.Figure 4 shows the ranked probability scores (left column) and the distancebetween true response and predicted response (right column) for the validationdata. It is seen that ordinal random forests outperform parametric models forthe ﬁrst three data sets only. A distinct advantage of ordinal forests is seen inparticular in the housing data set. For the heart and birth weight data setsthe performance of ordinal trees is only slightly superior. For the retinopathyand the GLES data simple parametric models perform much better than ordinalforests. The best performance is seen for the ensemble methods that combineparametric and nonparametric methods. They seem to eﬃciently combine thebest of two worlds yielding small errors for all data sets. Thus, if one wants toavoid ending up with an inferior prediction tool one should consider not onlytrees or parametric models but aim at a combination of these methods.In Figure the split-based based and adjacent categories forest utilize the ran-domForest method to ﬁt the contained binary trees. Very similar performanceis found when using alternative methods to ﬁt binary trees, like ctree or ordi-nalForest . These alternative methods yield diﬀerent trees. When investigatingsingle trees the choice of the method deﬁnitely makes a diﬀerence, and speciﬁctrees may oﬀer advantages, for example conditional trees , which use tests in thesplitting procedure, are able to control the signiﬁcance level and avoid selectionbias (Strobl et al., 2007; Hothorn et al., 2006) making them an attractive choice.However, for ensembles of trees as random forests the performance is very similar,at least in the case of split-based based and adjacent categories forests.13 om Adj RFord RFSplit RFadj Ens3 Ens5 . . . . Housing (n_L = 400)

Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . Housing (n_L = 400)

Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . . Heart (n_L = 200)

Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . Heart (n_L = 200)

Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . Birthweight (n_L = 160)

Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . . Birthweight (n_L = 160)

Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . Retinopathy (n_L = 500)

Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . Retinopathy (n_L = 500)

Pom Adj RFord RFSplit RFadj Ens3 Ens5 . . . . GLES (n_L = 400)

Figure 4:

Results for data sets; left: ranked probability score, right: distancebetween true and estimated value. Importance of Variables

While single trees for split variables are easy to interpret this does not hold forensembles of trees. Since variables appear in diﬀerent trees at diﬀerent positionsthe impact of variables is hard to infer from plots of hundreds of trees. On theother hand random forests allow for complex eﬀects of predictors, which makesit a ﬂexible prediction tool.There is a considerable amount of literature that deals with the developmentof importance measures for random forests, see, for example, Strobl et al. (2007,2008); Hapfelmeier et al. (2014); Gregorutti et al. (2017); Hothorn and Zeileis(2015). A naive measure simply counts the number of times each variable is se-lected by the individual trees in the ensemble. Better, more elaborate variableimportance measures incorporate a (weighted) mean of the individual trees’ im-provement in the splitting criterion produced by each variable. An example forsuch a measure is the ”Gini importance” available in the randomForest package.It describes the improvement in the ”Gini gain” splitting criterion. Alternative,and better variable importance measures are based on permutations yieldingso-called permutation accuracy importance measures (Strobl et al., 2007). Byrandomly permuting single predictor variables X j , the original association withthe response Y is broken. When the permuted variable X j , together with theremaining un-permuted predictor variables, is used to predict the response, theprediction accuracy is supposed to decrease if the variable X j had an additionalimpact on explaining the response. The diﬀerence in prediction accuracy beforeand after permuting X j yields a permutation accuracy importance measure.In the following we use the heart data to illustrate how importance measurescan be obtained for split-based and adjacent categories random forests. Of courseit depends on the algorithm that is used to grow binary trees which importancemeasure can be computed. Figure 5 shows the Gini importance when usingrandomForest to ﬁt the binary random forests. In the upper panels one sees theimportance measures obtained for the split variables, that is, for conditional splitsin adjacent categories RF on the left, and direct splits for split-based RF on theright. The numbers 1 to 4 indicate the splits. For example, 3 means that the splitis between categories { , , } and { } . It is seen that the ﬁrst six variables showstrong importance with the importance being stronger for lower categories splitsand weaker for higher category splits. The lower panel shows the importancemeasures averaged across the splits. The lower curves, which are almost identical,show the average for the adjacent categories and split-based random forest. Itshows, in addition, the Gini importance for the multi-category random forestobtained from randomForest. It is seen that the importance measures have thesame order for all the ﬁtted random forests. That the values of importance forthe multi-category random forest is higher than for the other two forests is merelya scaling eﬀect.Figure 6 shows the corresponding picture if conditional trees are used (cforest).15onditional trees avoid the bias that is found if categorical variables with varyingnumbers of categories and a mixture of categorical and continuous predictorsare used, see, for example, (Strobl et al., 2007). Consequently, the obtainedimportance measures diﬀer from the Gini importance measures. It is seen thatvariable 1,2 and 7 are very inﬂuential. In particular the importance of variable 1,which is a categorical variable, is more distinct than in Gini importance measures. Heart data

Variables

Heart data

Variables

Heart data

Variables

Figure 5:

Gini importance for heart data; variables 1 to 10: chest pain, oldpeak,age, trestbps, chol, thalach, exang, sex, fbs, restecg (randomForest ﬁt); left upperpanel: importance for conditional splits in adjacent categories RF, right upperpanel: importance for splits in split-based RF, lower panel: averaged importancemeasures for splits in adjacent categories and split-based RF (lower curves) andmulti-categorical ﬁt of randomForest (upper curve). − . . . . Heart data

Variables . . . . Heart data

Variables

Figure 6:

Importance for heart data; variables 1 to 10: chest pain, oldpeak,age, trestbps, chol, thalach, exang, sex, fbs, restecg (cforest ﬁt); left upper panel:importance for conditional splits in adjacent categories RF, right upper panel:importance for splits in split-based RF, lower panel: averaged importance mea-sures for splits in adjacent categories (lower curve) and split-based RF (uppercurve).

The split variables, which are the building blocks of ordinal models, have beenused to develop ordinal trees and random forests. The basic concept can also beused to generate alternative parametric or nonparametric classiﬁcation methodsthat account for the order in responses. One can, for example, use two-class lineardiscriminant analysis or binary models with variables selection by lasso in the caseof many predictors, or use nonparametric methods as the nearest neighborhoodclassiﬁer for two classes. All of these methods can be used to model the splitvariables conditionally or unconditionally. In the present paper we restrictedconsideration to random forests since the objective was to construct score-freerandom forests.Also the more recently proposed ordinal random forests are in some way in-spired by parametric ordinal models but in a diﬀerent way than the split variablesapproach propagated here. The score-free random forests proposed by Buri andHothorn (2020) follow a quite diﬀerent strategy to obtain random forests. Theyﬁt a cumulative logit model and use the likelihood contributions of the observa-17ions to obtain test statistics. The core idea is to regress the obtained partialderivatives of the log-likelihood on prognostic variables. By using the cumula-tive model the order of categories is used without the need for assigned scores.But it should be noted that the “pure” cumulative model is ﬁtted in subpopu-lations without including predictors. The ordinal forest propagated by Hornung(2020) also uses the cumulative logistic model. It exploits the latent continuousresponse variable underlying the observed ordinal response variable by explicitlyusing the widths of the adjacent intervals in the range of the continuous responsevariable. These intervals are considered as corresponding to the classes of theordinal response variable. That means, “the ordinal response variable is treatedas a continuous variable, where the diﬀering extents of the individual classes ofthe ordinal response variable are implicitly taken into account” (Hornung, 2020).The approach is closely related to conventional random forests for continuousoutcomes but optimizes the assigned scores instead of considering them as given,and therefore is score-free in a certain sense.

References

Agresti, A. (2010).

Analysis of Ordinal Categorical Data, 2nd Edition . New York:Wiley.Anderson, J. A. (1984). Regression and ordered categorical variables.

Journal ofthe Royal Statistical Society B 46 , 1–30.Anderson, J. A. and R. R. Phillips (1981). Regression, discrimination and mea-surement models for ordered categorical variables.

Applied Statistics 30 , 22–31.Andrich, D. (2013). An expanded derivation of the threshold structure of thepolytomous Rasch model that dispels any ’threshold disorder controversy’.

Ed-ucational and Psychological Measurement 73 (1), 78–124.Archer, K. J. (2010). rpartordinal: an R package for deriving a classiﬁcation treefor predicting an ordinal response.

Journal of Statistical Software 34 , 7.Bender, R. and U. Grouven (1998). Using binary logistic regression models forordinal data with non–proportional odds.

Journal of Clinical Epidemiology 51 ,809–816.Brant, R. (1990). Assessing proportionality in the proportional odds model forordinal logistic regression.

Biometrics 46 , 1171–1178.Breiman, L. (1996). Bagging predictors.

Machine Learning 24 , 123–140.Breiman, L. (2001). Random forests.

Machine Learning 45 , 5–32.18¨uhlmann, P., B. Yu, et al. (2002). Analyzing bagging.

The Annals of Statis-tics 30 (4), 927–961.Buri, M. and T. Hothorn (2020). Model-based random forests for ordinal regres-sion.

The International Journal of Biostatistics 1 (ahead-of-print).Campbell, M. K. and A. P. Donner (1989). Classiﬁcation eﬃciency of multino-mial logistic-regression relative to ordinal logistic-regression.

Journal of theAmerican Statistical Association 84 (406), 587–591.Campbell, M. K., A. P. Donner, and K. M. Webster (1991). Are ordinal modelsuseful for classiﬁcation?

Statistics in Medicine 10 , 383–394.Chu, W. and S. S. Keerthi (2007). Support vector ordinal regression.

Neuralcomputation 19 (3), 792–815.Cox, C. (1995). Location-scale cumulative odds models for ordinal data: Ageneralized non-linear model approach.

Statistics in Medicine 14 , 1191–1203.Fernandez, D., I. Liu, and R. Costilla (2019). A method for ordinal outcomes:The ordered stereotype model.

International Journal of Methods in PsychiatricResearch , e1801.Galimberti, G., G. Soﬀritti, and M. Di Maso (2012). Classiﬁcation trees for ordi-nal responses in R: the rpartscore package.

Journal of Statistical Software 47 .Gneiting, T. and A. Raftery (2007). Strictly proper scoring rules, prediction, andestimation.

Journal of the American Statistical Association 102 (477), 359–376.Goodman, L. A. (1981a). Association models and canonical correlation in theanalysis of cross-classiﬁcation having ordered categories.

Journal of the Amer-ican Statistical Association 76 , 320–334.Goodman, L. A. (1981b). Association models and the bivariate normal for con-tingency tables with ordered categories.

Biometrika 68 , 347–355.Greenland, S. (1994). Alternative models for ordinal logistic regression.

Statisticsin Medicine 13 , 1665–1677.Gregorutti, B., B. Michel, and P. Saint-Pierre (2017). Correlation and variableimportance in random forests.

Statistics and Computing 27 (3), 659–678.Hapfelmeier, A., T. Hothorn, K. Ulm, and C. Strobl (2014). A new variable im-portance measure for random forests with missing data.

Statistics and Com-puting 24 (1), 21–34.Hornung, R. (2020). Ordinal forests.

Journal of Classiﬁcation 37 , 4–17.19othorn, T., K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning:A conditional inference framework.

Journal of Computational and GraphicalStatistics 15 , 651–674.Hothorn, T. and A. Zeileis (2015). partykit: A modular toolkit for recursivepartytioning in r.

The Journal of Machine Learning Research 16 (1), 3905–3909.Iannario, M., D. Piccolo, and R. Simone (2020). CUB: a class of mix-ture models for ordinal data. R package version 1.1.4, http://cran.r-project.org/package=cub.Janitza, S., G. Tutz, and A.-L. Boulesteix (2016). Random forest for ordinalresponses: prediction and variable selection.

Computational Statistics & DataAnalysis 96 , 57–73.Kateri, M. (2014).

Contingency table analysis . Springer.Kim, J.-H. (2003). Assessing practical signiﬁcance of the proportional odds as-sumption.

Statistics & probability letters 65 (3), 233–239.Liaw, A., M. Wiener, L. Breiman, and A. Cutler (2015). Package randomforest.Liu, I., B. Mukherjee, T. Suesse, D. Sparrow, and S. K. Park (2009). Graphicaldiagnostics to check model misspeciﬁcation for the proportional odds regressionmodel.

Statistics in medicine 28 (3), 412–429.Masters, G. N. (1982). A Rasch model for partial credit scoring.

Psychome-trika 47 , 149–174.Masters, G. N. and B. Wright (1984). The essential process in a family of mea-surement models.

Psychometrika 49 , 529–544.McCullagh, P. (1980). Regression model for ordinal data (with discussion).

Jour-nal of the Royal Statistical Society B 42 , 109–127.Muraki, E. (1997). A generalized partial credit model.

Handbook of modern itemresponse theory , 153–164.Peterson, B. and F. E. Harrell (1990). Partial proportional odds models forordinal response variables.

Applied Statistics 39 , 205–217.Rattinger, H., S. Roßteutscher, R. Schmitt-Beck, B. Weßels, and C. Wolf (2014).Pre-election cross section (GLES 2013).

GESIS Data Archive, Cologne ZA5700Data ﬁle Version 2.0.0 . 20udolfer, S. M., P. C. Watson, and E. Lesaﬀre (1995). Are ordinal models use-ful for classiﬁcation? a revised analysis.

Journal of Statistical ComputationSimulation 52 (2), 105–132.Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008). Con-ditional variable importance for random forests.

BMC bioinformatics 9 (1),307.Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn (2007). Bias in randomforest variable importance measures: Illustrations, sources and a solution.

BMCbioinformatics 8 (1), 25.Tutz, G. (2012).

Regression for Categorical Data . Cambridge University Press.Tutz, G. (2021). Ordinal regression: a review and a taxonomy of models.