[PDF] Statistical Inference for Ordinal Predictors in Generalized Linear and Additive Models with Application to Bronchopulmonary Dysplasia

Abstract

Discrete but ordered covariates are quite common in applied statistics, and some regularized fitting procedures have been proposed for proper handling of ordinal predictors in statistical modeling. In this study, we show how quadratic penalties on adjacent dummy coefficients of ordinal predictors proposed in the literature can be incorporated in the framework of generalized additive models, making tools for statistical inference developed there available for ordinal predictors as well. Motivated by an application from neonatal medicine, we discuss whether results obtained when constructing confidence intervals and testing significance of smooth terms in generalized additive models are useful with ordinal predictors/penalties as well.

Full PDF

SStatistical Inference for Ordinal Predictors in Generalized Linear andAdditive Models with Application to Bronchopulmonary Dysplasia

Jan Gertheiss , ∗ , Fabian Scheipl , Tina Lauer , , Harald Ehrhardt , February 4, 2021 School of Economics and Social Sciences, Helmut Schmidt University, Hamburg, Germany Department of Statistics, Ludwig Maximilians University, Munich, Germany Department of General Pediatrics and Neonatology, Justus Liebig University, Giessen, Germany German Center for Lung Research (DZL), Universities of Giessen and Marburg Lung Center (UGMLC),Giessen, Germany ∗ To whom correspondence should be addressed: [email protected]

Abstract

Discrete but ordered covariates are quite common in applied statistics, and some regularized ﬁttingprocedures have been proposed for proper handling of ordinal predictors in statistical modeling. In thisstudy, we show how quadratic penalties on adjacent dummy coeﬃcients of ordinal predictors proposed inthe literature can be incorporated in the framework of generalized additive models, making tools for sta-tistical inference developed there available for ordinal predictors as well. Motivated by an application fromneonatal medicine, we discuss whether results obtained when constructing conﬁdence intervals and testingsigniﬁcance of smooth terms in generalized additive models are useful with ordinal predictors/penalties aswell.

Keywords : chronic lung disease, logit model, ordinal data, regularization, smoothing penalty

Bronchopulmonary Dysplasia (BPD) is a chronic lung disease often found in preterm infants with lungsnot fully developed. Disturbance of lung development and severity of BPD is caused by various peri-and postnatal factors including prematurity itself, as well as pre- and postnatal infections [16]. BPDis measured on ordinal scale with grades 0, 1, 2, 3, but often dichotomized as 0: ‘no/mild BPD’ and1: ‘moderate/severe BPD’. One goal of the study reported here is to investigate whether the time afterbirth some speciﬁc bacteria were found for the ﬁrst time in the children’s upper airway has an eﬀecton BPD. The study was conducted following the rules of the Declaration of Helsinki of 1975, revised in2013. The retrospective analysis was approved by the ethics committee of the Justus-Liebig-UniversityGiessen (Az 97/14). Initially, n = 102 preterm infants with a birth weight < ≤ + a r X i v : . [ s t a t . M E ] F e b > / no t f ound gram negative week f r equen cy moderate/severe PBDno/mild PBD > / no t f ound gram positive week f r equen cy > / no t f ound pathogenic week f r equen cy Figure 1: Detection of oral bacteria and occurrence of BPD.colonization. Earlier analyses already showed that the later bacteria were found, the lower the risk ofdeveloping BPD [22]. However, it is not fully understood yet which speciﬁc bacteria have an eﬀect, andin which way. Figure 1 shows occurrence of BPD depending on the week (after birth) three types ofbacteria (gram negative/positive and pathogen) were found for the ﬁrst time in the upper airway of therespective child. Although ‘time’ is supposed to be a continuous variable, information is only available ina discretized way here, because samples were only obtained once a week. Furthermore, the last categoryis open/censored. If an observation is falling in the last category, we only know that until week six therespective germ had not been found yet. So the covariate ‘time/week’ may only be considered as discretebut ordinal.Besides the information on bacteria, some additional risk factors need to be taken into account, such asthe weight and sex of the child, the number of days antibiotics and steroids were given, or information onmultiples. For doing so, a logit model with categorical/ordinal predictor ‘week’ and additional, potentiallyconfounding covariates may be ﬁt. Typically, a categorical covariate is included as dummy-coded factor,ignoring, however, the information on the categories ordering (if so). In the case presented, additionalproblems are caused by the fact that some categories only have a few observations (see Figure 1), andsometimes all those observations are falling in the same response category. The latter means that in alogit model ﬁtted via maximum likelihood corresponding regression coeﬃcients tend towards ±∞ .For preventing numerical problems like inﬂating regression coeﬃcients, penalization can be a viablesolution [20]. Furthermore, penalty terms can be used for exploiting/respecting the covariates’ ordinalscale level. Following Gertheiss & Tutz [13, 32, 33], for instance, a diﬀerence penalty might be put onadjacent dummy coeﬃcients of the ordinal factor when ﬁtting the model; an approach that has alreadybeen applied successfully in medical research, see, e.g., [2, 14]. In our application, however, the questionnaturally arises how to test for signiﬁcance of the ordinal predictor in the penalized setting. In a linearmodel with normal errors, this can be done using a (restricted) likelihood ratio test [3, 4, 15, 30], afterrewriting the ordinal penalty as a mixed model [12, 10] (see also Section 2). However, the correspondingtest is not available for generalized linear models, such as the logit model considered here.In this study, we will discuss how technology developed for generalized additive models [19, 37] canbe used to ﬁt generalized linear and additive models with ordinal smoothing penalty, and conduct furtherstatistical inference. In more detail, the rest of the paper is organized as follows. In Section 2, we will revisitgeneralized additive models, the ordinal penalty, and corresponding statistical inference. In Section 3, we2ill carry out some numerical experiments to investigate size and power of the test(s) described in Section 2as well as coverage of conﬁdence intervals. Furthermore, we will compare accuracy of penalized estimatesto well established standard procedures typically used for data analysis with ordinal covariates. The BPDdata will be analyzed in more detail in Section 4, and Section 5 concludes. Given a response y with distribution from a simple exponential family, and a set of covariates x , . . . , x p ,a generalized additive model [19] has the form: η = α + f ( x ) + . . . + f p ( x p ) , µ = h ( η ) , (1)where µ is the (conditional) mean of y given the covariates, h is a (known) response function, and η iscomparable to the linear predictor in generalized linear models [26, 25]. The diﬀerence to a generalizedlinear model is that non-linear functions f j , j = 1 , . . . , p , are allowed in η , but still the structure of η isadditive. Of course, if f j are restricted to be linear, a generalized linear model is obtained as a special case.In a (generalized) additive model, however, it is usually only assumed that functions f j are reasonablysmooth; and one way to ﬁt such models, as for instance implemented in the popular R package mgcv [29, 37],is to specify a set of basis functions for each predictor and to employ an appropriate, quadratic smoothingpenalty on the corresponding basis coeﬃcients. More precisely, we assume that f j ( x ) = q j (cid:88) r =1 β jr B jr ( x ) , (2)with B j ( x ) , . . . , B jq j ( x ) being the set of basis functions chosen for function f j , and β j , . . . , β jq j thecorresponding basis coeﬃcients. So by ﬁtting the basis coeﬃcients, the function f j is ﬁtted implicitly. Forinstance, a popular choice in case of a continuous x is the so-called B-spline basis, leading to f j beingmodeled as a spline function; see, e.g., de Boor [5], Dierckx [6] or Eilers & Marx [7] for details on (B-)splines. The big advantage of the basis functions approach is that after plugging-in the data observed x ij , i = 1 , . . . , n , j = 1 , . . . , p , vector f j = ( f j ( x j ) , . . . , f j ( x nj )) (cid:62) can be written as B j β j , with matrix ( B j ) ir = B jr ( x ij ) and vector β j = ( β j , . . . , β jq j ) (cid:62) . So model (1) can be written as a (generalized) linearmodel with coeﬃcients β jr , j = 1 , . . . , p , r = 1 , . . . , q j , and basis coeﬃcients can be ﬁtted accordingly, inparticular by maximum likelihood/Fisher scoring.The problem with this approach is that typically a rather large number of basis functions needs to bechosen for being suﬃciently ﬂexible with respect to the type of functions that can be ﬁtted. With a largebasis, however, the number of basis coeﬃcients to be ﬁtted becomes large, too. As a consequence, resultingfunctions tend to be wiggly and thus hard to interpret. Therefore, a penalty term J j ( β j ) is typically addedfor each covariate x j , penalizing wiggly basis coeﬃcients and thus wiggly functions f j . When combiningthe B-spline basis with a quadratic diﬀerence penalty on adjacent β jr , for instance, we obtain the so-calledP-spline [7, 24, 8].Now suppose you have a categorical predictor x j with levels , . . . , k j . Then there is a somewhatnatural basis: the basis of (dummy) functions ( l = 1 , . . . , k j ) B jl ( x j ) = (cid:40) if x j = l, otherwise . (3)Since we know that x j can only take values , . . . , k j , we do not need to think about the type andnumber of basis functions, placing of knots, etc., as we usually do with continuous covariates. For meansof identiﬁability, however, some linear restrictions need to placed on the basis/dummy coeﬃcients β jl .Typically, this is done by specifying a so-called reference category, e.g., category for categorical predictor3 j , and setting the corresponding β j = 0 . However, one may also set (cid:80) l β jl = 0 . In case of a continuous x j a popular constraint is n (cid:88) i =1 f j ( x ij ) = 0 for all j, (4)which translates into (cid:80) l n jl β jl = 0 for categorical predictor x j , with n jl being the number of samples withlevel l being observed for x j . Since the latter is the typical constraint in generalized additive models, alsoused in mgcv , we will use (4) here as well (if not stated otherwise). After having ﬁt the functions/basiscoeﬃcients, however, one can switch between constraints easily, because switching between the constraintsmentioned above for some f j yields an equivalent model as it simply means a vertical shift of the entirefunction f j and a corresponding change in the constant α in (1).If now a (quadratic) diﬀerence penalty is put on the basis/dummy coeﬃcients, this gives exactly thesmoothing penalty as mentioned above [13, 32, 33]. More precisely, with β jl , l = 1 , . . . , k j , denoting thedummy coeﬃcient of level l , the penalty primarily used is the ﬁrst-order penalty J j ( β j ) = k j (cid:88) l =2 ( β jl − β j,l − ) , (5)with β j = ( β j , . . . , β jk j ) (cid:62) . Alternatively, however, the second-order penalty J j ( β j ) = k j − (cid:88) l =2 ( β j,l +1 − β jl + β j,l − ) , (6)penalizing deviations from linearity can be used as well [12]. The strength of the penalty put on β j isdetermined by a parameter λ j , which may be diﬀerent for diﬀerent predictors x j . Then, parameters canbe ﬁtted by maximizing the penalized log-likelihood l p ( β ) = l ( β ) − p (cid:88) j =1 λ j J j ( β j ) , where β denotes the vector collecting all basis coeﬃcients and, if so, additional regression coeﬃcientsof covariates included parametrically, i.e., in terms of linear eﬀects; l ( β ) is the usual, unpenalized log-likelihood. In case of λ j = 0 for all j , the usual maximum likelihood estimates are obtained. If λ j → ∞ incase of penalty (5), ﬁtted f j ( x ) = 0 for all x ∈ { , . . . , k j } is obtained because of the constraint (no matterwhich one is chosen from the options given above). If penalty (6) is chosen, large λ j leads to a function f j being linear in the class labels , . . . , k j .One of the beneﬁts of considering ordinal predictors along with penalties (5) and (6) in the frameworkof generalized additive models is that after implementing basis (3) in the appropriate way, gam() from mgcv can be used directly to ﬁt a generalized linear/additive model with ordinal predictor(s) as needed forthe BPD data. Besides pure model ﬁtting, however, this provides us with additional tools; in particular,built-in estimation of the penalty/smoothing parameter(s), further statistical inference, such as conﬁdenceintervals, and checking signiﬁcance of smooth terms. A prerequisite for at least some of those tools,however, is to rewrite our model with quadratic smoothing penalty as a (generalized) linear mixed model. Starting with penalty (5), we may rewrite our model in terms of ˜ β j = ˜ D β j with ˜ D = (cid:34) . . . D (cid:35) , and D =  − . . . − . . . ... . . . . . . . . . . . .  . k j − elements of ˜ β j by u j = ( u j , . . . , u j,k j − ) (cid:62) ,such that u jl = β j,l +1 − β jl and u j = D β j , with D from above. The entries of u j are now interpreted asiid random eﬀects with u jl ∼ N (0 , τ j ) [12, 10, 31]. Then, for given variance parameters τ j , j = 1 , . . . , p (note, the random eﬀects’ variance may vary between covariates), maximizing the log-likelihood over ˜ β j yields estimates that are equivalent to the smoothed dummies obtained via penalty (5), with a one-to-one correspondence of penalty parameter λ j and variance parameter τ j . Alternatively, smoothed dummycoeﬃcients can be derived in a Bayesian framework (as the mode of the posterior density) by puttinga Gaussian random walk prior (with prior variance τ j ) on the dummy coeﬃcients [13]. Analogously,penalty (6) can be derived/interpreted in a mixed model/Bayesian framework. For that purpose, wereplace ˜ D above by ˜ D with ˜ D =  . . .

00 1 0 . . . D  , and D =  − . . . − . . . ... . . . . . . . . . . . . . . .  . Then, the last k j − elements of ˜ β j are denoted by u j = ( u j , . . . , u j,k j − ) (cid:62) , such that u jl = β j,l +2 − β j,l +1 + β jl , and as before u jl ∼ N (0 , τ j ) are interpreted as iid random eﬀects [12, 37].Both approaches, the mixed model and Bayesian interpretation, can be used for determining thevariance components τ j , and thereby the penalty parameters λ j . In theory, we may integrate out therandom eﬀects from the joint density of the response and random eﬀects, giving the marginal likelihoodof the ﬁxed eﬀects and the variance parameters. Maximizing this likelihood leads to ML estimates of theﬁxed eﬀects and variance parameters. In generalized linear mixed models, however, calculating the integralanalytically is often not possible, and also numerically demanding. The standard approach is so-called Laplace approximation [1], which essentially cycles through the penalized log-likelihood given above andplugging-in the corresponding estimates of regression/basis coeﬃcients to obtain an approximate proﬁlelikelihood for the variance parameters that can be maximized. In the Bayesian framework, the smootheddummies are then interpreted as an empirical

Bayes estimator, since τ j are estimated from the data. In afully Bayesian approach, we could choose a hyper-prior, e.g., an Inverse Gamma, for τ j and apply MarkovChain Monte Carlo (MCMC); but we will focus on the mixed model/empirical Bayes approach here.A problem with ML estimation of variance parameters is that those estimates are typically biaseddownwards, that is, the true variance tends to be underestimated, in particular if the number of ﬁxedeﬀects is large. As an alternative to ML which reduces this bias, so-called restricted maximum likelihood(REML) estimation has been proposed, which can be motivated in diﬀerent ways [28, 17, 18, 21, 9].Eventually, in the linear mixed model, the (proﬁle) log-likelihood is (additively) corrected such that thenumber/structure of ﬁxed eﬀects is taken into account. In generalized mixed models, this can be doneanalogously within Laplace approximation. It should be noted that REML cannot be used to comparemodels with diﬀerent ﬁxed eﬀect structure. Nevertheless, REML is very popular for estimating variancecomponents in mixed models, due to the reduced bias, and hence for determining smoothing/penaltyparameters in (generalized) additive models as well [35]. Besides likelihood-based methods, also predictionerror methods such as (generalized) cross-validation [34] or the AIC [38] can be used for smoothnessselection. In particular the Bayesian interpretation of quadratic smoothing penalties is useful to derive conﬁdenceintervals. In the Gaussian identity link/linear mixed model case, we can derive the covariance matrix ofthe regression/basis coeﬃcients’ posterior distribution. In the generalized case the corresponding matrixfrom the penalized iteratively weighted least squares algorithm (PIRLS) used to estimate the parametersis taken. Using this matrix, let’s say V β , we can calculate (point-wise) credible intervals for function ˆ f j = B j ˆ β , denoting the vector of ˆ f j ( x ) values at evaluation points x ij . For each element ˆ f ji of ˆ f j , we5btain an approximate (1 − α )100 % credible interval via ˆ f ji ± z − α/ √ v ji , where v ji is the i th diagonalelement of B j V β B (cid:62) j , and z q is the q -quantile of the standard normal distribution. It turned out that thosecredible regions also have good frequentist coverage rates in generalized additive models with continuouscovariates [27, 23]. In Section 3, we will investigate whether this is also the case with smoothed ordinaleﬀects.Closely related to the conﬁdence intervals described above is testing signiﬁcance of smooth terms.Speciﬁcally, we would like to test null hypothesis H : f j ( x ) = 0 for all potential values x of predictor x j . With ordinal predictors, this means, for all levels of the predictor of interest. Analogously to theconﬁdence intervals above, let ˆ f j = B j ˆ β denote the vector of ˆ f j ( x ) evaluated at the predictor levels (i.e.,with appropriately chosen dummy-matrix B j ). Then, following [36, 37], we can deﬁne a Wald-type teststatistic T = ˆ f (cid:62) j V − j ˆ f j , where V − j is an appropriately chosen pseudo-inverse of V j = B j V β B (cid:62) j . Under H , T (approximately)follows a mixture of χ -distributions. For the concrete form of this mixture and V − j , see Wood [36, 37].When taking the mixed models perspective of penalty (5), the null hypothesis above can alternativelybe written as H : τ j = 0 . In the Gaussian/linear mixed model with only one smooth term/variancecomponent a (restricted) likelihood ratio test can be used [3, 4, 30], which also works well with ordinaldata [10, 12, 31], and provides extensions/approximations in case of multiple ordinal predictors/smoothterms [10, 15, 30]. To the best of our knowledge, however, no generalization exists beyond models withGaussian response. Therefore we will focus on the Wald-type test here.For instance, when looking at summary.gam() with ordinal predictor x denoting the (ﬁrst) week ofoverall oral bacteria detection, and correcting for various (known) risk factors of BPD and/or potentialconfounders, we have: Family: binomialLink function: logitFormula:y ~ s(weight, bs = "cr") + SGAsym + sex + mult + s(steroid, bs = "cr") +s(anti, bs = "cr") + s(x, bs = "ordinal", m = 2)Parametric coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) -2.8377 0.8537 -3.324 0.000888 ***SGAsym 2.0615 1.2461 1.654 0.098061 .sex 1.8593 0.7143 2.603 0.009242 **mult 0.5417 0.4102 1.321 0.186584---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Approximate significance of smooth terms:edf Ref.df Chi.sq p-values(weight) 1 1 13.151 0.000287 ***s(steroid) 1 1 4.527 0.033364 *s(anti) 1 1 4.675 0.030600 *s(x) 1 1 5.627 0.017682 *---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 .00 0.02 0.04 0.06 0.08 0.10 . . . . . . gram negative uniform quantile ob s e r v ed p − v a l ue s first−order (m=1)second−order (m=2) 0.00 0.02 0.04 0.06 0.08 0.10 . . . . . . gram positive uniform quantile ob s e r v ed p − v a l ue s . . . . . . pathogenic uniform quantile ob s e r v ed p − v a l ue s Figure 2: QQ-plots of p-values for ﬁrst- and second-order penalty (red/blue).

R-sq.(adj) = 0.438 Deviance explained = 42%-REML = 30.444 Scale est. = 1 n = 101

Here,

SGAsym is an indicator ‘Small for Gestational Age’ deﬁned as birth weight, length and headcircumference below the 10th percentile according to the percentiles from the German perinatal registry, sex is an indicator for male, and mult denotes multiples. The child’s weight at birth is denoted by weight , steroid is the number of days (before delivery) steroids were given (to the mother), and anti the numberof days (after birth) antibiotics were given (to the child). We see that all the smooth eﬀects are ﬁtted aslinear eventually (as eﬀective degrees of freedom/edf equal 1), and indicted as signiﬁcant on the 5% level(at least), which conﬁrms results obtained earlier for this dataset [22]. In fact, p-values given above are(virtually) identical to the p-values when ﬁtting a standard logit model with linear eﬀects (e.g., using Rfunction glm() , not shown here) indicating the negative eﬀect of x , i.e., the risk of BPD is increased ifoverall bacterial colonization is detected early. In general, however, it is not clear yet if p-values providedin the generalized additive model are reliable for the ordinal smoothing penalty. So this needs to beinvestigated before using the ordinal smoothing approach for analyzing the eﬀects of speciﬁc types ofbacteria. For investigating whether the p-values provided by summary.gam() are reliable when using the ordinalbasis/smoothing penalty, we will discuss the results of some simulation studies, starting with the BPDdata.

We use the confounder model, i.e., the model with information on bacteria removed, to estimate BPDprobabilities. That means, the null hypothesis that the eﬀect of (ordinal predictor) x is zero, is true byconstruction, because ﬁtted BPD probabilities do not depend on x , given the other covariates. Usingthose probabilities, we simulate ‘new’ BPD response data, ﬁt the model with smooth ordinal x added (andsmoothing parameter estimated by REML), and report p-values of x . For x , we use gram negative/positive7 − . − . . . Near Straight Line x f ( x ) . . . . . . Gaussian/Identity y r e j e c t i on r a t e ordinal linear factor 0.0 0.5 1.0 1.5 2.0 . . . . . . Binary/Logit y r e j e c t i on r a t e ordinal linear factor 1 2 3 4 5 6 − . . . . Non−Monotone x f ( x ) . . . . . . Gaussian/Identity y r e j e c t i on r a t e ordinal linear factor 0.0 0.5 1.0 1.5 2.0 . . . . . . Binary/Logit y r e j e c t i on r a t e ordinal linear factor Figure 3: Size and power for various tests and settings with α = 0 . (solid) or α = 0 . (dashed); toprow refers to near straight line, bottom row to non-monotone scenario.and pathogenic bacteria, respectively (compare Figure 1). For each x , this is done 1,000 times; andFigure 2 shows corresponding QQ-plots of p-values observed employing the ﬁrst- and second-order penalty,respectively. Since the distribution of smaller p-values is particularly relevant when testing with usual α ≤ . , we restrict plotting to that area.It is seen that p-values obtained when employing the ﬁrst-order penalty are typically too small. Prob-lems with the ﬁrst-order penalty can be explained by the fact that the null space of the correspondingsmooth term has dimension zero (compare the mgcv manual). In other words, the null hypothesis in theframework of mixed models (which is used for estimation here), a zero variance component, is on theboundary of the parameter space, which means that standard theory does not apply [3, 4]. Results for thesecond-order penalty, by contrast, look very encouraging. As a consequence of the ﬁndings on the BPDdata, we will only consider the second-order penalty when further investigating size and power below.8 c o v e r age 0 . . . . . . Gaussian/Near Straight Line c o v e r age 0 . . . . . . Gaussian/Non−Monotone c o v e r age 0 . . . . . . Logit/Near Straight Line c o v e r age 0 . . . . . . Logit/Non−Monotone

Figure 4: Coverage observed for (nominally) 95% conﬁdence intervals when using the ﬁrst-order (yellow)and second-order (blue) penalty in diﬀerent simulation settings.

Given a continuous covariate z ∼ U (0 , and an ordinal predictor x uniformly sampled from { , . . . , } ,we assume• a Gaussian distribution for y with µ = 3 √ z + ψf ( x ) , and standard normal errors,• a logit model with linear predictor η = − √ z + ψf ( x ) .In both cases, parameter ψ ≥ determines the eﬀect size of x . For f , we consider the two functions givenin Figure 3 (left). Furthermore size and power over 10,000 replicates with sample size 100 each are shownfor the Gaussian/identity (center) and binary/logit case (right), α = 0 . (solid) and α = 0 . (dashed),and three diﬀerent tests: besides the smooth eﬀect (black), we consider including x as parametric/linearterm (green) or nominal factor (blue). It is seen that the ordinal smoothing penalty adapts well to thesituations considered. On the one hand, if the true, underlying function is close to linear (center), thecorresponding test behaves almost identically to the parametric/linear speciﬁcation. On the other hand, inthe very non-linear/non-monotone case, its power is similar to the unpenalized nominal/factor modeling.Although the latter seems to be better in the non-monotone, binary/logit case (bottom right), it alsoappears that the analysis of deviance/GLRT used here produces a rejection rate for eﬀect size zero thatis somewhat too high in the logit model (i.e., p-values are too small). In summary, our studies indicatethat the ordinal penalty works well in generalized additive models (with smoothing parameters estimatedby REML), and testing as provided by summary.gam() in mgcv is safe for the second-order penalty, withpower highly competitive to both linear and ordinary factor modeling. We consider the same settings as in Section 3.2, with ψ = 0 . and ψ = 1 . in the Gaussian and logit case,respectively. For comparing the estimated to the true functions/coeﬃcients (see Figure 3, left), estimatedfunctions/coeﬃcients are centered such that (cid:80) l ˆ β l = (cid:80) l ˆ f ( l ) = 0 (which is also the case for true f ’s inFigure 3). Figure 4 shows the coverage observed over 1000 replicates for (point-wise) conﬁdence intervals9 ****************************************************** ******************************************************** ************************************* ************************************************************************************ ********************************************************************************************** ***************************************************************************** **************************************************************** *********************************************************************************** ******************************************************************************* ************************************** *************** ************************************************************************* ****************************************************************** ********************************************************************** ************** **************************************************************** ***************************************************************** *********************************************************************************** ************************************************************************************** ************************************************************************************* ************************************************************************** *********************************************************************************** ****************************************************************** ******************************************************************************* m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r . . . . . . s qua r ed e rr o r Gaussian/Near Straight Line level 1 level 2 level 3 level 4 level 5 level 6 ************************************************************************* ******************************************************************************** ************************ ************************************************************************************ ************************************************************************************* ************************************************************************** *************************************************** *********************************************************************************** ******************************************************************* **************************************************************** ********* ************************************************************************* ************************************************* ************************************************************************* ********* **************************************************************** ********************************************************************************** ************************************************************************** ****************************************************** ************************************************************************************* *********************************************************************** ************************************************************************ ****************** ******************************************************************************* m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r . . . . . . s qua r ed e rr o r Gaussian/Non−Monotone level 1 level 2 level 3 level 4 level 5 level 6

Figure 5: Squared errors in the Gaussian scenarios for ﬁrst and second-order penalty (i.e., m = 1 and m = 2 , respectively), linear and ordinary factor modeling; a couple of more extreme outliers are not shown,in particular for ordinary factor modeling.with a nominal level of 95%. Here we consider both the ﬁrst- and second-order penalty. Although thetests from Section 3.1 are not reliable for the ﬁrst-order penalty, this penalty may still be used for modelﬁtting and conﬁdence intervals provide important information on the estimates’ variability. Indeed, asseen from Figure 4, intervals work reasonably well in both the normal and logit case with non-monotonecoeﬃcients. If coeﬃcients are close to being linear, coverage (on average) is also close to 95% when usingthe ﬁrst-order penalty. The second-order penalty, however, leads to substantial under-coverage in the ‘nearstraight line’ case. This problem is also found for (generalized) additive models with continuous covariatesif deviations from linearity are penalized and the ﬁtted regression function is close to being linear. Thesuggested ﬁx is to change the target of inference to the smooth term plus the overall model intercept [23, 37].Figure 5 and 6 show graphical summaries of the squared errors when using the ﬁrst and second-orderpenalty (i.e., m = 1 and m = 2 , respectively), linear and ordinary factor modeling. We see that both m = 1 and m = 2 produce highly competitive results, with the second-order penalty being the clearoverall winner. If the true coeﬃcients nearly lie on a straight line (left panel), assuming linearity overthe factor levels also produces good results, which is not surprising of course. In case of the very non-monotone coeﬃcients (right panel), however, the assumption of linearity is highly misleading. Ordinaryfactor modeling generally suﬀers from very large variance compared to the penalized estimates, leadingto a much larger mean squared error. The is particularly problematic with binary response data (logitmodel), because sometimes there are only 1s or 0s on some predictor levels, leading to an inﬂation of thecorresponding coeﬃcients, that means, coeﬃcients are tending towards ±∞ . Those very extreme outliersare not even shown in Figure 6. Of course, those problems may vanish when increasing the sampling size10 *************************************************************** ******************************************************************************** *************************************************** ******************************************************** ************************************************************************************************ ********************************************************************************************** ********************************************************************************************* ***************************************************** ********************************************************************************************************* *************************************************************** *********************************** ********************************************* ********************************************************************************************* ******************************************************************************************************************** ************* ******************************************************* *********************************************************************************************** ************************************************************************************************************** ******************************************************************************************************** *********************************** ************************************************************************************************** ************************************************************************************************* ********************************************************************************************* *********************************** m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r s qua r ed e rr o r Logit/Near Straight Line level 1 level 2 level 3 level 4 level 5 level 6 ******************************************************************** ************************************************************************************ ******************** * ****************************************************************************************************** **************************************************************************************** *************************************************************************** * ***************************************** **************************************************************************** *************** * ********************************************************************************** **************************************************************************************** ************* ***** ****************************************** *************************************************************************** *************************************** **** ************************************************************************** ********************************************************************************** *********************************** **** m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r s qua r ed e rr o r Logit/Non−Monotone level 1 level 2 level 3 level 4 level 5 level 6

Figure 6: Squared errors in the logit scenarios for ﬁrst and second-order penalty (i.e., m = 1 and m = 2 ,respectively), linear and ordinary factor modeling; a couple of more extreme outliers are not shown, inparticular for ordinary factor modeling.and/or changing the coeﬃcients used to generate the data. With the BPD data, however, we only have amoderate sample size ( n ≈ ) and even categories with BPD cases/controls only (see Figure 1). So wefocus on comparable settings here. In addition to the model with predictor ‘overall oral bacterial colonization’ as given on page 6, whichresulted in a generalized but otherwise purely linear model (due to the ‘smooth’ terms ﬁtted as linear), wealso ﬁtted a more detailed model with bacterial colonization categorized into ‘gram negative’, ‘gram posi-tive’, and ‘pathogenic’. As before, the corresponding ordinal predictor gives the week colonization by therespective type of (oral) bacteria was detected for the ﬁrst time. In the original publication [22], separatemodels were ﬁt for each type of bacteria, and the ﬁrst two weeks were collapsed to make (unpenalized)model ﬁtting with dummy-coded ordinal factors feasible. Thanks to the penalties presented here, we arenow able to include all three predictors jointly while using all information with the resolution available.Table 1 (left) shows the results for the parametric terms. In particular it is seen/conﬁrmed thatlow birth weight is a risk factor for BPD, and also male infants and multiples have an increased risk ofdeveloping BPD. Antenatal steroids, by contrast, may decrease the risk. Results for the diﬀerent types ofbacterial colonization, which are included as ordinal predictors with smooth eﬀects, are given in Table 2and Figure 7. We see that the only signiﬁcant eﬀect is detected for pathogenic bacteria. The ﬁtted functionindicates that early detection is associated with increased risk of BPD. Statistical uncertainty, however, isvery large (due to the small number of samples in category/level 1) as indicated by the conﬁdence interval.11 ull model Reduced modelCovariate Est. S.E z-val. p-val. Est. S.E. z-val. p-val. (Intercept) 6.214 2.630 2.363 0.018 6.426 2.170 2.961 0.003Weight (g) − − < − − < − − − − We showed how ordinal predictors in generalized linear models that are ﬁtted by use of quadratic diﬀerencepenalties as proposed in the literature can be incorporated and interpreted in the framework of generalizedadditive models, making tools for statistical inference developed there available for ordinal predictors aswell. Here we focused on ﬁtting itself with built-in selection of penalty parameters, and further statisticalinference in terms of (pointwise) conﬁdence intervals and testing signiﬁcance of smooth terms, i.e., signiﬁ-cance of the ordinal predictors. In particular, we conﬁrmed that at least some of the tests provided in thepopular mgcv

R package can be used with ordinal predictors, too. More speciﬁcally, this is the case forthe Wald-type tests when employing the second-order diﬀerence penalty on adjacent dummy coeﬃcients.It should be pointed out, however, that this is particularly true if smoothing parameters are estimatedby REML (or ML), which has been investigated in this study. When using GCV (which is the default in mgcv ), this is not necessarily the case.Besides testing in a pre-deﬁned model as done here, those tests may also be used for stepwise (back-ward/forward) variable selection, which is still quite popular among applied researchers, including but not

Full model Reduced modelPredictor edf Ref.df Chi.sq p-val. edf Ref.df Chi.sq p-val.

Gram negative 1.000 1.000 1.307 0.253 – – – –Gram positive 3.708 4.393 5.227 0.264 – – – –Pathogenic 5.258 5.831 13.711 0.030 4.973 5.696 13.573 0.027Table 2: Results for smooth terms in the full and reduced model when using the second-order ordinalsmoothing penalty. 12 − level c oe ff i c i en t gram negative − level c oe ff i c i en t gram positive − level c oe ff i c i en t pathogenic Figure 7: Fitted coeﬃcients of ordinal predictors together with pointwise 95% conﬁdence intervals in thefull model with second-order penalties.limited to medicine. However, the methods discussed here are primarily useful if the number of (ordinal)covariates is moderate. In high-dimensional problems, more precisely, if the number of covariates becomeslarge, other techniques than statistical testing may be used for model selection; for instance, sparsityinducing penalties like the group lasso [39, 11].Add-on functions implementing the ordinal basis for use within mgcv are available on request and willbe made publicly available through R package ordPens (which is currently under revision). After sourcingthe respective functions (or installing and loading ordPens , respectively), the gam() function from mgcv can be used with smooth terms s(..., bs = "ordinal") .In summary, using the ordinal smoothing penalty within generalized additive models oﬀers a veryconvenient and ﬂexible way to respect and exploit the information provided by ordinally scaled predictorsin a sound statistical framework with elaborated tools for statistical inference. Built-in selection of tuningparameters yields estimates that adapt very well to the data structure with accuracy at least competitiveif not better than standard linear/factor modeling.

Author contributions

Jan Gertheiss conducted the data analysis/numerical experiments and wrote the manuscript. FabianScheipl implemented the method for use within mgcv . Tina Lauer and Harald Ehrhardt collected the dataand were involved in the discussion of the manuscript.

Funding

This work was supported in part by Deutsche Forschungsgemeinschaft (DFG) under Grant GE2353/2-1.

References [1] N E Breslow and D G Clayton. Approximate inference in generalized linear mixed models.

Journalof the American Statistical Association , 88:9–25, 1993.132] A Cieza, C Oberhauser, J Bickenbach, S Chatterji, and G Stucki. Towards a minimal generic set ofdomains of functioning and health.

BMC Public Health , 14:218, 2014.[3] C M Crainiceanu and D Ruppert. Likelihood ratio tests in linear mixed models with one variancecomponent.

Journal of the Royal Statistical Society B , 66:165–185, 2004.[4] C M Crainiceanu, D Ruppert, G Claeskens, and M P Wand. Exact likelihood ratio tests for penalizedsplines.

Biometrika , 77:91–103, 2005.[5] C de Boor.

A Practical Guide to Splines . Springer, New York, 1978.[6] P Dierckx.

Curve and Surface Fitting with Splines . Claredon Press, Oxford, 1993.[7] P H C Eilers and B D Marx. Flexible smoothing with B-splines and penalties.

Statistical Science ,11:89–121, 1996.[8] P H C Eilers and B D Marx. Generalized linear additive smooth structures.

Journal of Computationaland Graphical Statistics , 11:758–783, 2002.[9] L Fahrmeir, T Kneib, S Lang, and B Marx.

Regression–Models, Methods and Applications . Springer,Berlin/Heidelberg, 2013.[10] J Gertheiss. Anova for factors with ordered levels.

Journal of Agricultural, Biological, and Environ-mental Statistics , 19:258–277, 2014.[11] J Gertheiss, S Hogger, C Oberhauser, and G Tutz. Selection of ordinally scaled independent variableswith applications to international classiﬁcation of functioning core sets.

Journal of the Royal StatisticalSociety C , 60:377–395, 2011.[12] J Gertheiss and F Oehrlein. Testing relevance and linearity of ordinal predictors.

Electronic Journalof Statistics , 5:1935–1959, 2011.[13] J Gertheiss and G Tutz. Penalized regression with ordinal predictors.

International Statistical Review ,77:345–365, 2009.[14] S M Glass and S E Ross. Modiﬁed functional movement screening as a predictor of tactical perfor-mance potential in recreationally active adults.

The International Journal of Sports Physical Therapy ,10:612–621, 2015.[15] S Greven, C M Crainiceanu, H Küchenhoﬀ, and A Peters. Restricted likelihood ratio testing forzero variance components in linear mixed models.

Journal of Computational and Graphical Statistics ,17:870–891, 2008.[16] J Gronbach, T Shahzad, S Radajewski, C-M Chao, S Bellusci, R E Morty, T Reicherzer, andH Ehrhardt. The potentials and caveats of mesenchymal stromal cell-based therapies in the preterminfant.

Stem Cells International , page Article ID 9652897, 2018.[17] D A Harville. Bayesian inference for variance components using only error contrasts.

Biometrika ,61:383–385, 1974.[18] D A Harville. Maximum likelihood approaches to variance component estimation and to relatedproblems.

Journal of the American Statistical Association , 72:320–338, 1977.[19] T Hastie and R Tibshirani.

Generalized Additive Models . Chapman & Hall, London, 1990.[20] A E Hoerl and R W Kennard. Ridge regression: Biased estimation for nonorthogonal problems.

Technometrics , 12:55–67, 1970. 1421] N M Laird and J H Ware. Random-eﬀects models for longitudinal data.

Biometrics , 38:963–974,1982.[22] T Lauer, J Behnke, F Oehmke, J Bäcker, K Gentil, T Chakraborty, M Schloter, J Gertheiss, andH Ehrhardt. Bacterial colonization within the ﬁrst six weeks of life and pulmonary outcome in preterminfants < 1000g.

Journal of Clinical Medicine , 9:2240, 2020.[23] G Marra and S N Wood. Coverage properties of conﬁdence intervals for generalized additive modelcomponents.

Scandinavian Journal of Statistics , 39:53–74, 2012.[24] B D Marx and P H C Eilers. Direct generalized additive modelling with penalized likelihood.

Com-putational Statistics & Data Analysis , 28:193–209, 1998.[25] P McCullagh and J A Nelder.

Generalized Linear Models . Chapman & Hall, New York, 2nd edition,1989.[26] J A Nelder and R W M Wedderburn. Generalized linear models.

Journal of the Royal StatisticalSociety A , 135:370–384, 1972.[27] D Nychka. Bayesian conﬁdence intervals of smoothing splines.

Journal of the American StatisticalAssociation , 83:1134–1143, 1988.[28] H D Patterson and R Thompson. Recovery of interblock information when block sizes are unequal.

Biometrika , 58:545–554, 1971.[29] R Core Team.

R: A Language and Environment for Statistical Computing . R Foundation for StatisticalComputing, Vienna, Austria, 2020.[30] F Scheipl, S Greven, and H Küchenhoﬀ. Size and power of tests for a zero random eﬀect variance orpolynomial regression in additive and linear mixed models.

Computational Statistics & Data Analysis ,52:3283–3299, 2008.[31] E Sweeney, C Crainiceanu, and J Gertheiss. Testing diﬀerentially expressed genes in dose-responsestudies and with ordinal phenotypes.

Statistical Applications in Genetics and Molecular Biology ,15:213–235, 2016.[32] G Tutz and J Gertheiss. Rating scales as predictors – the old question of scale level and some answers.

Psychometrika , 79:357–736, 2014.[33] G Tutz and J Gertheiss. Regularized regression for categorical data (with discussion).

StatisticalModelling , 16:161–200, 2016.[34] S N Wood. Fast stable direct ﬁtting and smoothness selection for generalized additive models.

Journalof the Royal Statistical Society B , 70:495–518, 2008.[35] S N Wood. Fast stable restricted maximum likelihood and marginal likelihood estimation of semi-parametric generalized linear models.

Journal of the Royal Statistical Society B , 73:3–36, 2011.[36] S N Wood. On p-values for smooth components of an extended generalized additive model.

Biometrika ,100:221–228, 2013.[37] S N Wood.

Generalized Additive Models: An Introduction with R . CRC Press, Boca Raton, 2ndedition, 2017.[38] S N Wood, N Pya, and B. Saefken. Smoothing parameter and model selection for general smoothmodels (with discussion).

Journal of the American Statistical Association , 111:1548–1575, 2016.1539] M Yuan and Y Lin. Model selection and estimation in regression with grouped variables.