Statistical Inference for Ordinal Predictors in Generalized Linear and Additive Models with Application to Bronchopulmonary Dysplasia
SStatistical Inference for Ordinal Predictors in Generalized Linear andAdditive Models with Application to Bronchopulmonary Dysplasia
Jan Gertheiss , ∗ , Fabian Scheipl , Tina Lauer , , Harald Ehrhardt , February 4, 2021 School of Economics and Social Sciences, Helmut Schmidt University, Hamburg, Germany Department of Statistics, Ludwig Maximilians University, Munich, Germany Department of General Pediatrics and Neonatology, Justus Liebig University, Giessen, Germany German Center for Lung Research (DZL), Universities of Giessen and Marburg Lung Center (UGMLC),Giessen, Germany ∗ To whom correspondence should be addressed: [email protected]
Abstract
Discrete but ordered covariates are quite common in applied statistics, and some regularized fittingprocedures have been proposed for proper handling of ordinal predictors in statistical modeling. In thisstudy, we show how quadratic penalties on adjacent dummy coefficients of ordinal predictors proposed inthe literature can be incorporated in the framework of generalized additive models, making tools for sta-tistical inference developed there available for ordinal predictors as well. Motivated by an application fromneonatal medicine, we discuss whether results obtained when constructing confidence intervals and testingsignificance of smooth terms in generalized additive models are useful with ordinal predictors/penalties aswell.
Keywords : chronic lung disease, logit model, ordinal data, regularization, smoothing penalty
Bronchopulmonary Dysplasia (BPD) is a chronic lung disease often found in preterm infants with lungsnot fully developed. Disturbance of lung development and severity of BPD is caused by various peri-and postnatal factors including prematurity itself, as well as pre- and postnatal infections [16]. BPDis measured on ordinal scale with grades 0, 1, 2, 3, but often dichotomized as 0: ‘no/mild BPD’ and1: ‘moderate/severe BPD’. One goal of the study reported here is to investigate whether the time afterbirth some specific bacteria were found for the first time in the children’s upper airway has an effecton BPD. The study was conducted following the rules of the Declaration of Helsinki of 1975, revised in2013. The retrospective analysis was approved by the ethics committee of the Justus-Liebig-UniversityGiessen (Az 97/14). Initially, n = 102 preterm infants with a birth weight < ≤ + a r X i v : . [ s t a t . M E ] F e b > / no t f ound gram negative week f r equen cy moderate/severe PBDno/mild PBD > / no t f ound gram positive week f r equen cy > / no t f ound pathogenic week f r equen cy Figure 1: Detection of oral bacteria and occurrence of BPD.colonization. Earlier analyses already showed that the later bacteria were found, the lower the risk ofdeveloping BPD [22]. However, it is not fully understood yet which specific bacteria have an effect, andin which way. Figure 1 shows occurrence of BPD depending on the week (after birth) three types ofbacteria (gram negative/positive and pathogen) were found for the first time in the upper airway of therespective child. Although ‘time’ is supposed to be a continuous variable, information is only available ina discretized way here, because samples were only obtained once a week. Furthermore, the last categoryis open/censored. If an observation is falling in the last category, we only know that until week six therespective germ had not been found yet. So the covariate ‘time/week’ may only be considered as discretebut ordinal.Besides the information on bacteria, some additional risk factors need to be taken into account, such asthe weight and sex of the child, the number of days antibiotics and steroids were given, or information onmultiples. For doing so, a logit model with categorical/ordinal predictor ‘week’ and additional, potentiallyconfounding covariates may be fit. Typically, a categorical covariate is included as dummy-coded factor,ignoring, however, the information on the categories ordering (if so). In the case presented, additionalproblems are caused by the fact that some categories only have a few observations (see Figure 1), andsometimes all those observations are falling in the same response category. The latter means that in alogit model fitted via maximum likelihood corresponding regression coefficients tend towards ±∞ .For preventing numerical problems like inflating regression coefficients, penalization can be a viablesolution [20]. Furthermore, penalty terms can be used for exploiting/respecting the covariates’ ordinalscale level. Following Gertheiss & Tutz [13, 32, 33], for instance, a difference penalty might be put onadjacent dummy coefficients of the ordinal factor when fitting the model; an approach that has alreadybeen applied successfully in medical research, see, e.g., [2, 14]. In our application, however, the questionnaturally arises how to test for significance of the ordinal predictor in the penalized setting. In a linearmodel with normal errors, this can be done using a (restricted) likelihood ratio test [3, 4, 15, 30], afterrewriting the ordinal penalty as a mixed model [12, 10] (see also Section 2). However, the correspondingtest is not available for generalized linear models, such as the logit model considered here.In this study, we will discuss how technology developed for generalized additive models [19, 37] canbe used to fit generalized linear and additive models with ordinal smoothing penalty, and conduct furtherstatistical inference. In more detail, the rest of the paper is organized as follows. In Section 2, we will revisitgeneralized additive models, the ordinal penalty, and corresponding statistical inference. In Section 3, we2ill carry out some numerical experiments to investigate size and power of the test(s) described in Section 2as well as coverage of confidence intervals. Furthermore, we will compare accuracy of penalized estimatesto well established standard procedures typically used for data analysis with ordinal covariates. The BPDdata will be analyzed in more detail in Section 4, and Section 5 concludes. Given a response y with distribution from a simple exponential family, and a set of covariates x , . . . , x p ,a generalized additive model [19] has the form: η = α + f ( x ) + . . . + f p ( x p ) , µ = h ( η ) , (1)where µ is the (conditional) mean of y given the covariates, h is a (known) response function, and η iscomparable to the linear predictor in generalized linear models [26, 25]. The difference to a generalizedlinear model is that non-linear functions f j , j = 1 , . . . , p , are allowed in η , but still the structure of η isadditive. Of course, if f j are restricted to be linear, a generalized linear model is obtained as a special case.In a (generalized) additive model, however, it is usually only assumed that functions f j are reasonablysmooth; and one way to fit such models, as for instance implemented in the popular R package mgcv [29, 37],is to specify a set of basis functions for each predictor and to employ an appropriate, quadratic smoothingpenalty on the corresponding basis coefficients. More precisely, we assume that f j ( x ) = q j (cid:88) r =1 β jr B jr ( x ) , (2)with B j ( x ) , . . . , B jq j ( x ) being the set of basis functions chosen for function f j , and β j , . . . , β jq j thecorresponding basis coefficients. So by fitting the basis coefficients, the function f j is fitted implicitly. Forinstance, a popular choice in case of a continuous x is the so-called B-spline basis, leading to f j beingmodeled as a spline function; see, e.g., de Boor [5], Dierckx [6] or Eilers & Marx [7] for details on (B-)splines. The big advantage of the basis functions approach is that after plugging-in the data observed x ij , i = 1 , . . . , n , j = 1 , . . . , p , vector f j = ( f j ( x j ) , . . . , f j ( x nj )) (cid:62) can be written as B j β j , with matrix ( B j ) ir = B jr ( x ij ) and vector β j = ( β j , . . . , β jq j ) (cid:62) . So model (1) can be written as a (generalized) linearmodel with coefficients β jr , j = 1 , . . . , p , r = 1 , . . . , q j , and basis coefficients can be fitted accordingly, inparticular by maximum likelihood/Fisher scoring.The problem with this approach is that typically a rather large number of basis functions needs to bechosen for being sufficiently flexible with respect to the type of functions that can be fitted. With a largebasis, however, the number of basis coefficients to be fitted becomes large, too. As a consequence, resultingfunctions tend to be wiggly and thus hard to interpret. Therefore, a penalty term J j ( β j ) is typically addedfor each covariate x j , penalizing wiggly basis coefficients and thus wiggly functions f j . When combiningthe B-spline basis with a quadratic difference penalty on adjacent β jr , for instance, we obtain the so-calledP-spline [7, 24, 8].Now suppose you have a categorical predictor x j with levels , . . . , k j . Then there is a somewhatnatural basis: the basis of (dummy) functions ( l = 1 , . . . , k j ) B jl ( x j ) = (cid:40) if x j = l, otherwise . (3)Since we know that x j can only take values , . . . , k j , we do not need to think about the type andnumber of basis functions, placing of knots, etc., as we usually do with continuous covariates. For meansof identifiability, however, some linear restrictions need to placed on the basis/dummy coefficients β jl .Typically, this is done by specifying a so-called reference category, e.g., category for categorical predictor3 j , and setting the corresponding β j = 0 . However, one may also set (cid:80) l β jl = 0 . In case of a continuous x j a popular constraint is n (cid:88) i =1 f j ( x ij ) = 0 for all j, (4)which translates into (cid:80) l n jl β jl = 0 for categorical predictor x j , with n jl being the number of samples withlevel l being observed for x j . Since the latter is the typical constraint in generalized additive models, alsoused in mgcv , we will use (4) here as well (if not stated otherwise). After having fit the functions/basiscoefficients, however, one can switch between constraints easily, because switching between the constraintsmentioned above for some f j yields an equivalent model as it simply means a vertical shift of the entirefunction f j and a corresponding change in the constant α in (1).If now a (quadratic) difference penalty is put on the basis/dummy coefficients, this gives exactly thesmoothing penalty as mentioned above [13, 32, 33]. More precisely, with β jl , l = 1 , . . . , k j , denoting thedummy coefficient of level l , the penalty primarily used is the first-order penalty J j ( β j ) = k j (cid:88) l =2 ( β jl − β j,l − ) , (5)with β j = ( β j , . . . , β jk j ) (cid:62) . Alternatively, however, the second-order penalty J j ( β j ) = k j − (cid:88) l =2 ( β j,l +1 − β jl + β j,l − ) , (6)penalizing deviations from linearity can be used as well [12]. The strength of the penalty put on β j isdetermined by a parameter λ j , which may be different for different predictors x j . Then, parameters canbe fitted by maximizing the penalized log-likelihood l p ( β ) = l ( β ) − p (cid:88) j =1 λ j J j ( β j ) , where β denotes the vector collecting all basis coefficients and, if so, additional regression coefficientsof covariates included parametrically, i.e., in terms of linear effects; l ( β ) is the usual, unpenalized log-likelihood. In case of λ j = 0 for all j , the usual maximum likelihood estimates are obtained. If λ j → ∞ incase of penalty (5), fitted f j ( x ) = 0 for all x ∈ { , . . . , k j } is obtained because of the constraint (no matterwhich one is chosen from the options given above). If penalty (6) is chosen, large λ j leads to a function f j being linear in the class labels , . . . , k j .One of the benefits of considering ordinal predictors along with penalties (5) and (6) in the frameworkof generalized additive models is that after implementing basis (3) in the appropriate way, gam() from mgcv can be used directly to fit a generalized linear/additive model with ordinal predictor(s) as needed forthe BPD data. Besides pure model fitting, however, this provides us with additional tools; in particular,built-in estimation of the penalty/smoothing parameter(s), further statistical inference, such as confidenceintervals, and checking significance of smooth terms. A prerequisite for at least some of those tools,however, is to rewrite our model with quadratic smoothing penalty as a (generalized) linear mixed model. Starting with penalty (5), we may rewrite our model in terms of ˜ β j = ˜ D β j with ˜ D = (cid:34) . . . D (cid:35) , and D = − . . . − . . . ... . . . . . . . . . . . . . k j − elements of ˜ β j by u j = ( u j , . . . , u j,k j − ) (cid:62) ,such that u jl = β j,l +1 − β jl and u j = D β j , with D from above. The entries of u j are now interpreted asiid random effects with u jl ∼ N (0 , τ j ) [12, 10, 31]. Then, for given variance parameters τ j , j = 1 , . . . , p (note, the random effects’ variance may vary between covariates), maximizing the log-likelihood over ˜ β j yields estimates that are equivalent to the smoothed dummies obtained via penalty (5), with a one-to-one correspondence of penalty parameter λ j and variance parameter τ j . Alternatively, smoothed dummycoefficients can be derived in a Bayesian framework (as the mode of the posterior density) by puttinga Gaussian random walk prior (with prior variance τ j ) on the dummy coefficients [13]. Analogously,penalty (6) can be derived/interpreted in a mixed model/Bayesian framework. For that purpose, wereplace ˜ D above by ˜ D with ˜ D = . . .
00 1 0 . . . D , and D = − . . . − . . . ... . . . . . . . . . . . . . . . . Then, the last k j − elements of ˜ β j are denoted by u j = ( u j , . . . , u j,k j − ) (cid:62) , such that u jl = β j,l +2 − β j,l +1 + β jl , and as before u jl ∼ N (0 , τ j ) are interpreted as iid random effects [12, 37].Both approaches, the mixed model and Bayesian interpretation, can be used for determining thevariance components τ j , and thereby the penalty parameters λ j . In theory, we may integrate out therandom effects from the joint density of the response and random effects, giving the marginal likelihoodof the fixed effects and the variance parameters. Maximizing this likelihood leads to ML estimates of thefixed effects and variance parameters. In generalized linear mixed models, however, calculating the integralanalytically is often not possible, and also numerically demanding. The standard approach is so-called Laplace approximation [1], which essentially cycles through the penalized log-likelihood given above andplugging-in the corresponding estimates of regression/basis coefficients to obtain an approximate profilelikelihood for the variance parameters that can be maximized. In the Bayesian framework, the smootheddummies are then interpreted as an empirical
Bayes estimator, since τ j are estimated from the data. In afully Bayesian approach, we could choose a hyper-prior, e.g., an Inverse Gamma, for τ j and apply MarkovChain Monte Carlo (MCMC); but we will focus on the mixed model/empirical Bayes approach here.A problem with ML estimation of variance parameters is that those estimates are typically biaseddownwards, that is, the true variance tends to be underestimated, in particular if the number of fixedeffects is large. As an alternative to ML which reduces this bias, so-called restricted maximum likelihood(REML) estimation has been proposed, which can be motivated in different ways [28, 17, 18, 21, 9].Eventually, in the linear mixed model, the (profile) log-likelihood is (additively) corrected such that thenumber/structure of fixed effects is taken into account. In generalized mixed models, this can be doneanalogously within Laplace approximation. It should be noted that REML cannot be used to comparemodels with different fixed effect structure. Nevertheless, REML is very popular for estimating variancecomponents in mixed models, due to the reduced bias, and hence for determining smoothing/penaltyparameters in (generalized) additive models as well [35]. Besides likelihood-based methods, also predictionerror methods such as (generalized) cross-validation [34] or the AIC [38] can be used for smoothnessselection. In particular the Bayesian interpretation of quadratic smoothing penalties is useful to derive confidenceintervals. In the Gaussian identity link/linear mixed model case, we can derive the covariance matrix ofthe regression/basis coefficients’ posterior distribution. In the generalized case the corresponding matrixfrom the penalized iteratively weighted least squares algorithm (PIRLS) used to estimate the parametersis taken. Using this matrix, let’s say V β , we can calculate (point-wise) credible intervals for function ˆ f j = B j ˆ β , denoting the vector of ˆ f j ( x ) values at evaluation points x ij . For each element ˆ f ji of ˆ f j , we5btain an approximate (1 − α )100 % credible interval via ˆ f ji ± z − α/ √ v ji , where v ji is the i th diagonalelement of B j V β B (cid:62) j , and z q is the q -quantile of the standard normal distribution. It turned out that thosecredible regions also have good frequentist coverage rates in generalized additive models with continuouscovariates [27, 23]. In Section 3, we will investigate whether this is also the case with smoothed ordinaleffects.Closely related to the confidence intervals described above is testing significance of smooth terms.Specifically, we would like to test null hypothesis H : f j ( x ) = 0 for all potential values x of predictor x j . With ordinal predictors, this means, for all levels of the predictor of interest. Analogously to theconfidence intervals above, let ˆ f j = B j ˆ β denote the vector of ˆ f j ( x ) evaluated at the predictor levels (i.e.,with appropriately chosen dummy-matrix B j ). Then, following [36, 37], we can define a Wald-type teststatistic T = ˆ f (cid:62) j V − j ˆ f j , where V − j is an appropriately chosen pseudo-inverse of V j = B j V β B (cid:62) j . Under H , T (approximately)follows a mixture of χ -distributions. For the concrete form of this mixture and V − j , see Wood [36, 37].When taking the mixed models perspective of penalty (5), the null hypothesis above can alternativelybe written as H : τ j = 0 . In the Gaussian/linear mixed model with only one smooth term/variancecomponent a (restricted) likelihood ratio test can be used [3, 4, 30], which also works well with ordinaldata [10, 12, 31], and provides extensions/approximations in case of multiple ordinal predictors/smoothterms [10, 15, 30]. To the best of our knowledge, however, no generalization exists beyond models withGaussian response. Therefore we will focus on the Wald-type test here.For instance, when looking at summary.gam() with ordinal predictor x denoting the (first) week ofoverall oral bacteria detection, and correcting for various (known) risk factors of BPD and/or potentialconfounders, we have: Family: binomialLink function: logitFormula:y ~ s(weight, bs = "cr") + SGAsym + sex + mult + s(steroid, bs = "cr") +s(anti, bs = "cr") + s(x, bs = "ordinal", m = 2)Parametric coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) -2.8377 0.8537 -3.324 0.000888 ***SGAsym 2.0615 1.2461 1.654 0.098061 .sex 1.8593 0.7143 2.603 0.009242 **mult 0.5417 0.4102 1.321 0.186584---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Approximate significance of smooth terms:edf Ref.df Chi.sq p-values(weight) 1 1 13.151 0.000287 ***s(steroid) 1 1 4.527 0.033364 *s(anti) 1 1 4.675 0.030600 *s(x) 1 1 5.627 0.017682 *---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 .00 0.02 0.04 0.06 0.08 0.10 . . . . . . gram negative uniform quantile ob s e r v ed p − v a l ue s first−order (m=1)second−order (m=2) 0.00 0.02 0.04 0.06 0.08 0.10 . . . . . . gram positive uniform quantile ob s e r v ed p − v a l ue s . . . . . . pathogenic uniform quantile ob s e r v ed p − v a l ue s Figure 2: QQ-plots of p-values for first- and second-order penalty (red/blue).
R-sq.(adj) = 0.438 Deviance explained = 42%-REML = 30.444 Scale est. = 1 n = 101
Here,
SGAsym is an indicator ‘Small for Gestational Age’ defined as birth weight, length and headcircumference below the 10th percentile according to the percentiles from the German perinatal registry, sex is an indicator for male, and mult denotes multiples. The child’s weight at birth is denoted by weight , steroid is the number of days (before delivery) steroids were given (to the mother), and anti the numberof days (after birth) antibiotics were given (to the child). We see that all the smooth effects are fitted aslinear eventually (as effective degrees of freedom/edf equal 1), and indicted as significant on the 5% level(at least), which confirms results obtained earlier for this dataset [22]. In fact, p-values given above are(virtually) identical to the p-values when fitting a standard logit model with linear effects (e.g., using Rfunction glm() , not shown here) indicating the negative effect of x , i.e., the risk of BPD is increased ifoverall bacterial colonization is detected early. In general, however, it is not clear yet if p-values providedin the generalized additive model are reliable for the ordinal smoothing penalty. So this needs to beinvestigated before using the ordinal smoothing approach for analyzing the effects of specific types ofbacteria. For investigating whether the p-values provided by summary.gam() are reliable when using the ordinalbasis/smoothing penalty, we will discuss the results of some simulation studies, starting with the BPDdata.
We use the confounder model, i.e., the model with information on bacteria removed, to estimate BPDprobabilities. That means, the null hypothesis that the effect of (ordinal predictor) x is zero, is true byconstruction, because fitted BPD probabilities do not depend on x , given the other covariates. Usingthose probabilities, we simulate ‘new’ BPD response data, fit the model with smooth ordinal x added (andsmoothing parameter estimated by REML), and report p-values of x . For x , we use gram negative/positive7 − . − . . . Near Straight Line x f ( x ) . . . . . . Gaussian/Identity y r e j e c t i on r a t e ordinal linear factor 0.0 0.5 1.0 1.5 2.0 . . . . . . Binary/Logit y r e j e c t i on r a t e ordinal linear factor 1 2 3 4 5 6 − . . . . Non−Monotone x f ( x ) . . . . . . Gaussian/Identity y r e j e c t i on r a t e ordinal linear factor 0.0 0.5 1.0 1.5 2.0 . . . . . . Binary/Logit y r e j e c t i on r a t e ordinal linear factor Figure 3: Size and power for various tests and settings with α = 0 . (solid) or α = 0 . (dashed); toprow refers to near straight line, bottom row to non-monotone scenario.and pathogenic bacteria, respectively (compare Figure 1). For each x , this is done 1,000 times; andFigure 2 shows corresponding QQ-plots of p-values observed employing the first- and second-order penalty,respectively. Since the distribution of smaller p-values is particularly relevant when testing with usual α ≤ . , we restrict plotting to that area.It is seen that p-values obtained when employing the first-order penalty are typically too small. Prob-lems with the first-order penalty can be explained by the fact that the null space of the correspondingsmooth term has dimension zero (compare the mgcv manual). In other words, the null hypothesis in theframework of mixed models (which is used for estimation here), a zero variance component, is on theboundary of the parameter space, which means that standard theory does not apply [3, 4]. Results for thesecond-order penalty, by contrast, look very encouraging. As a consequence of the findings on the BPDdata, we will only consider the second-order penalty when further investigating size and power below.8 c o v e r age 0 . . . . . . Gaussian/Near Straight Line c o v e r age 0 . . . . . . Gaussian/Non−Monotone c o v e r age 0 . . . . . . Logit/Near Straight Line c o v e r age 0 . . . . . . Logit/Non−Monotone
Figure 4: Coverage observed for (nominally) 95% confidence intervals when using the first-order (yellow)and second-order (blue) penalty in different simulation settings.
Given a continuous covariate z ∼ U (0 , and an ordinal predictor x uniformly sampled from { , . . . , } ,we assume• a Gaussian distribution for y with µ = 3 √ z + ψf ( x ) , and standard normal errors,• a logit model with linear predictor η = − √ z + ψf ( x ) .In both cases, parameter ψ ≥ determines the effect size of x . For f , we consider the two functions givenin Figure 3 (left). Furthermore size and power over 10,000 replicates with sample size 100 each are shownfor the Gaussian/identity (center) and binary/logit case (right), α = 0 . (solid) and α = 0 . (dashed),and three different tests: besides the smooth effect (black), we consider including x as parametric/linearterm (green) or nominal factor (blue). It is seen that the ordinal smoothing penalty adapts well to thesituations considered. On the one hand, if the true, underlying function is close to linear (center), thecorresponding test behaves almost identically to the parametric/linear specification. On the other hand, inthe very non-linear/non-monotone case, its power is similar to the unpenalized nominal/factor modeling.Although the latter seems to be better in the non-monotone, binary/logit case (bottom right), it alsoappears that the analysis of deviance/GLRT used here produces a rejection rate for effect size zero thatis somewhat too high in the logit model (i.e., p-values are too small). In summary, our studies indicatethat the ordinal penalty works well in generalized additive models (with smoothing parameters estimatedby REML), and testing as provided by summary.gam() in mgcv is safe for the second-order penalty, withpower highly competitive to both linear and ordinary factor modeling. We consider the same settings as in Section 3.2, with ψ = 0 . and ψ = 1 . in the Gaussian and logit case,respectively. For comparing the estimated to the true functions/coefficients (see Figure 3, left), estimatedfunctions/coefficients are centered such that (cid:80) l ˆ β l = (cid:80) l ˆ f ( l ) = 0 (which is also the case for true f ’s inFigure 3). Figure 4 shows the coverage observed over 1000 replicates for (point-wise) confidence intervals9 ****************************************************** ******************************************************** ************************************* ************************************************************************************ ********************************************************************************************** ***************************************************************************** **************************************************************** *********************************************************************************** ******************************************************************************* ************************************** *************** ************************************************************************* ****************************************************************** ********************************************************************** ************** **************************************************************** ***************************************************************** *********************************************************************************** ************************************************************************************** ************************************************************************************* ************************************************************************** *********************************************************************************** ****************************************************************** ******************************************************************************* m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r . . . . . . s qua r ed e rr o r Gaussian/Near Straight Line level 1 level 2 level 3 level 4 level 5 level 6 ************************************************************************* ******************************************************************************** ************************ ************************************************************************************ ************************************************************************************* ************************************************************************** *************************************************** *********************************************************************************** ******************************************************************* **************************************************************** ********* ************************************************************************* ************************************************* ************************************************************************* ********* **************************************************************** ********************************************************************************** ************************************************************************** ****************************************************** ************************************************************************************* *********************************************************************** ************************************************************************ ****************** ******************************************************************************* m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r . . . . . . s qua r ed e rr o r Gaussian/Non−Monotone level 1 level 2 level 3 level 4 level 5 level 6
Figure 5: Squared errors in the Gaussian scenarios for first and second-order penalty (i.e., m = 1 and m = 2 , respectively), linear and ordinary factor modeling; a couple of more extreme outliers are not shown,in particular for ordinary factor modeling.with a nominal level of 95%. Here we consider both the first- and second-order penalty. Although thetests from Section 3.1 are not reliable for the first-order penalty, this penalty may still be used for modelfitting and confidence intervals provide important information on the estimates’ variability. Indeed, asseen from Figure 4, intervals work reasonably well in both the normal and logit case with non-monotonecoefficients. If coefficients are close to being linear, coverage (on average) is also close to 95% when usingthe first-order penalty. The second-order penalty, however, leads to substantial under-coverage in the ‘nearstraight line’ case. This problem is also found for (generalized) additive models with continuous covariatesif deviations from linearity are penalized and the fitted regression function is close to being linear. Thesuggested fix is to change the target of inference to the smooth term plus the overall model intercept [23, 37].Figure 5 and 6 show graphical summaries of the squared errors when using the first and second-orderpenalty (i.e., m = 1 and m = 2 , respectively), linear and ordinary factor modeling. We see that both m = 1 and m = 2 produce highly competitive results, with the second-order penalty being the clearoverall winner. If the true coefficients nearly lie on a straight line (left panel), assuming linearity overthe factor levels also produces good results, which is not surprising of course. In case of the very non-monotone coefficients (right panel), however, the assumption of linearity is highly misleading. Ordinaryfactor modeling generally suffers from very large variance compared to the penalized estimates, leadingto a much larger mean squared error. The is particularly problematic with binary response data (logitmodel), because sometimes there are only 1s or 0s on some predictor levels, leading to an inflation of thecorresponding coefficients, that means, coefficients are tending towards ±∞ . Those very extreme outliersare not even shown in Figure 6. Of course, those problems may vanish when increasing the sampling size10 *************************************************************** ******************************************************************************** *************************************************** ******************************************************** ************************************************************************************************ ********************************************************************************************** ********************************************************************************************* ***************************************************** ********************************************************************************************************* *************************************************************** *********************************** ********************************************* ********************************************************************************************* ******************************************************************************************************************** ************* ******************************************************* *********************************************************************************************** ************************************************************************************************************** ******************************************************************************************************** *********************************** ************************************************************************************************** ************************************************************************************************* ********************************************************************************************* *********************************** m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r s qua r ed e rr o r Logit/Near Straight Line level 1 level 2 level 3 level 4 level 5 level 6 ******************************************************************** ************************************************************************************ ******************** * ****************************************************************************************************** **************************************************************************************** *************************************************************************** * ***************************************** **************************************************************************** *************** * ********************************************************************************** **************************************************************************************** ************* ***** ****************************************** *************************************************************************** *************************************** **** ************************************************************************** ********************************************************************************** *********************************** **** m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r m = m = li nea r f a c t o r s qua r ed e rr o r Logit/Non−Monotone level 1 level 2 level 3 level 4 level 5 level 6
Figure 6: Squared errors in the logit scenarios for first and second-order penalty (i.e., m = 1 and m = 2 ,respectively), linear and ordinary factor modeling; a couple of more extreme outliers are not shown, inparticular for ordinary factor modeling.and/or changing the coefficients used to generate the data. With the BPD data, however, we only have amoderate sample size ( n ≈ ) and even categories with BPD cases/controls only (see Figure 1). So wefocus on comparable settings here. In addition to the model with predictor ‘overall oral bacterial colonization’ as given on page 6, whichresulted in a generalized but otherwise purely linear model (due to the ‘smooth’ terms fitted as linear), wealso fitted a more detailed model with bacterial colonization categorized into ‘gram negative’, ‘gram posi-tive’, and ‘pathogenic’. As before, the corresponding ordinal predictor gives the week colonization by therespective type of (oral) bacteria was detected for the first time. In the original publication [22], separatemodels were fit for each type of bacteria, and the first two weeks were collapsed to make (unpenalized)model fitting with dummy-coded ordinal factors feasible. Thanks to the penalties presented here, we arenow able to include all three predictors jointly while using all information with the resolution available.Table 1 (left) shows the results for the parametric terms. In particular it is seen/confirmed thatlow birth weight is a risk factor for BPD, and also male infants and multiples have an increased risk ofdeveloping BPD. Antenatal steroids, by contrast, may decrease the risk. Results for the different types ofbacterial colonization, which are included as ordinal predictors with smooth effects, are given in Table 2and Figure 7. We see that the only significant effect is detected for pathogenic bacteria. The fitted functionindicates that early detection is associated with increased risk of BPD. Statistical uncertainty, however, isvery large (due to the small number of samples in category/level 1) as indicated by the confidence interval.11 ull model Reduced modelCovariate Est. S.E z-val. p-val. Est. S.E. z-val. p-val. (Intercept) 6.214 2.630 2.363 0.018 6.426 2.170 2.961 0.003Weight (g) − − < − − < − − − − We showed how ordinal predictors in generalized linear models that are fitted by use of quadratic differencepenalties as proposed in the literature can be incorporated and interpreted in the framework of generalizedadditive models, making tools for statistical inference developed there available for ordinal predictors aswell. Here we focused on fitting itself with built-in selection of penalty parameters, and further statisticalinference in terms of (pointwise) confidence intervals and testing significance of smooth terms, i.e., signifi-cance of the ordinal predictors. In particular, we confirmed that at least some of the tests provided in thepopular mgcv
R package can be used with ordinal predictors, too. More specifically, this is the case forthe Wald-type tests when employing the second-order difference penalty on adjacent dummy coefficients.It should be pointed out, however, that this is particularly true if smoothing parameters are estimatedby REML (or ML), which has been investigated in this study. When using GCV (which is the default in mgcv ), this is not necessarily the case.Besides testing in a pre-defined model as done here, those tests may also be used for stepwise (back-ward/forward) variable selection, which is still quite popular among applied researchers, including but not
Full model Reduced modelPredictor edf Ref.df Chi.sq p-val. edf Ref.df Chi.sq p-val.
Gram negative 1.000 1.000 1.307 0.253 – – – –Gram positive 3.708 4.393 5.227 0.264 – – – –Pathogenic 5.258 5.831 13.711 0.030 4.973 5.696 13.573 0.027Table 2: Results for smooth terms in the full and reduced model when using the second-order ordinalsmoothing penalty. 12 − level c oe ff i c i en t gram negative − level c oe ff i c i en t gram positive − level c oe ff i c i en t pathogenic Figure 7: Fitted coefficients of ordinal predictors together with pointwise 95% confidence intervals in thefull model with second-order penalties.limited to medicine. However, the methods discussed here are primarily useful if the number of (ordinal)covariates is moderate. In high-dimensional problems, more precisely, if the number of covariates becomeslarge, other techniques than statistical testing may be used for model selection; for instance, sparsityinducing penalties like the group lasso [39, 11].Add-on functions implementing the ordinal basis for use within mgcv are available on request and willbe made publicly available through R package ordPens (which is currently under revision). After sourcingthe respective functions (or installing and loading ordPens , respectively), the gam() function from mgcv can be used with smooth terms s(..., bs = "ordinal") .In summary, using the ordinal smoothing penalty within generalized additive models offers a veryconvenient and flexible way to respect and exploit the information provided by ordinally scaled predictorsin a sound statistical framework with elaborated tools for statistical inference. Built-in selection of tuningparameters yields estimates that adapt very well to the data structure with accuracy at least competitiveif not better than standard linear/factor modeling.
Author contributions
Jan Gertheiss conducted the data analysis/numerical experiments and wrote the manuscript. FabianScheipl implemented the method for use within mgcv . Tina Lauer and Harald Ehrhardt collected the dataand were involved in the discussion of the manuscript.
Funding
This work was supported in part by Deutsche Forschungsgemeinschaft (DFG) under Grant GE2353/2-1.
References [1] N E Breslow and D G Clayton. Approximate inference in generalized linear mixed models.
Journalof the American Statistical Association , 88:9–25, 1993.132] A Cieza, C Oberhauser, J Bickenbach, S Chatterji, and G Stucki. Towards a minimal generic set ofdomains of functioning and health.
BMC Public Health , 14:218, 2014.[3] C M Crainiceanu and D Ruppert. Likelihood ratio tests in linear mixed models with one variancecomponent.
Journal of the Royal Statistical Society B , 66:165–185, 2004.[4] C M Crainiceanu, D Ruppert, G Claeskens, and M P Wand. Exact likelihood ratio tests for penalizedsplines.
Biometrika , 77:91–103, 2005.[5] C de Boor.
A Practical Guide to Splines . Springer, New York, 1978.[6] P Dierckx.
Curve and Surface Fitting with Splines . Claredon Press, Oxford, 1993.[7] P H C Eilers and B D Marx. Flexible smoothing with B-splines and penalties.
Statistical Science ,11:89–121, 1996.[8] P H C Eilers and B D Marx. Generalized linear additive smooth structures.
Journal of Computationaland Graphical Statistics , 11:758–783, 2002.[9] L Fahrmeir, T Kneib, S Lang, and B Marx.
Regression–Models, Methods and Applications . Springer,Berlin/Heidelberg, 2013.[10] J Gertheiss. Anova for factors with ordered levels.
Journal of Agricultural, Biological, and Environ-mental Statistics , 19:258–277, 2014.[11] J Gertheiss, S Hogger, C Oberhauser, and G Tutz. Selection of ordinally scaled independent variableswith applications to international classification of functioning core sets.
Journal of the Royal StatisticalSociety C , 60:377–395, 2011.[12] J Gertheiss and F Oehrlein. Testing relevance and linearity of ordinal predictors.
Electronic Journalof Statistics , 5:1935–1959, 2011.[13] J Gertheiss and G Tutz. Penalized regression with ordinal predictors.
International Statistical Review ,77:345–365, 2009.[14] S M Glass and S E Ross. Modified functional movement screening as a predictor of tactical perfor-mance potential in recreationally active adults.
The International Journal of Sports Physical Therapy ,10:612–621, 2015.[15] S Greven, C M Crainiceanu, H Küchenhoff, and A Peters. Restricted likelihood ratio testing forzero variance components in linear mixed models.
Journal of Computational and Graphical Statistics ,17:870–891, 2008.[16] J Gronbach, T Shahzad, S Radajewski, C-M Chao, S Bellusci, R E Morty, T Reicherzer, andH Ehrhardt. The potentials and caveats of mesenchymal stromal cell-based therapies in the preterminfant.
Stem Cells International , page Article ID 9652897, 2018.[17] D A Harville. Bayesian inference for variance components using only error contrasts.
Biometrika ,61:383–385, 1974.[18] D A Harville. Maximum likelihood approaches to variance component estimation and to relatedproblems.
Journal of the American Statistical Association , 72:320–338, 1977.[19] T Hastie and R Tibshirani.
Generalized Additive Models . Chapman & Hall, London, 1990.[20] A E Hoerl and R W Kennard. Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics , 12:55–67, 1970. 1421] N M Laird and J H Ware. Random-effects models for longitudinal data.
Biometrics , 38:963–974,1982.[22] T Lauer, J Behnke, F Oehmke, J Bäcker, K Gentil, T Chakraborty, M Schloter, J Gertheiss, andH Ehrhardt. Bacterial colonization within the first six weeks of life and pulmonary outcome in preterminfants < 1000g.
Journal of Clinical Medicine , 9:2240, 2020.[23] G Marra and S N Wood. Coverage properties of confidence intervals for generalized additive modelcomponents.
Scandinavian Journal of Statistics , 39:53–74, 2012.[24] B D Marx and P H C Eilers. Direct generalized additive modelling with penalized likelihood.
Com-putational Statistics & Data Analysis , 28:193–209, 1998.[25] P McCullagh and J A Nelder.
Generalized Linear Models . Chapman & Hall, New York, 2nd edition,1989.[26] J A Nelder and R W M Wedderburn. Generalized linear models.
Journal of the Royal StatisticalSociety A , 135:370–384, 1972.[27] D Nychka. Bayesian confidence intervals of smoothing splines.
Journal of the American StatisticalAssociation , 83:1134–1143, 1988.[28] H D Patterson and R Thompson. Recovery of interblock information when block sizes are unequal.
Biometrika , 58:545–554, 1971.[29] R Core Team.
R: A Language and Environment for Statistical Computing . R Foundation for StatisticalComputing, Vienna, Austria, 2020.[30] F Scheipl, S Greven, and H Küchenhoff. Size and power of tests for a zero random effect variance orpolynomial regression in additive and linear mixed models.
Computational Statistics & Data Analysis ,52:3283–3299, 2008.[31] E Sweeney, C Crainiceanu, and J Gertheiss. Testing differentially expressed genes in dose-responsestudies and with ordinal phenotypes.
Statistical Applications in Genetics and Molecular Biology ,15:213–235, 2016.[32] G Tutz and J Gertheiss. Rating scales as predictors – the old question of scale level and some answers.
Psychometrika , 79:357–736, 2014.[33] G Tutz and J Gertheiss. Regularized regression for categorical data (with discussion).
StatisticalModelling , 16:161–200, 2016.[34] S N Wood. Fast stable direct fitting and smoothness selection for generalized additive models.
Journalof the Royal Statistical Society B , 70:495–518, 2008.[35] S N Wood. Fast stable restricted maximum likelihood and marginal likelihood estimation of semi-parametric generalized linear models.
Journal of the Royal Statistical Society B , 73:3–36, 2011.[36] S N Wood. On p-values for smooth components of an extended generalized additive model.
Biometrika ,100:221–228, 2013.[37] S N Wood.
Generalized Additive Models: An Introduction with R . CRC Press, Boca Raton, 2ndedition, 2017.[38] S N Wood, N Pya, and B. Saefken. Smoothing parameter and model selection for general smoothmodels (with discussion).
Journal of the American Statistical Association , 111:1548–1575, 2016.1539] M Yuan and Y Lin. Model selection and estimation in regression with grouped variables.