[PDF] Adaptive Step-Length Selection in Gradient Boosting for Generalized Additive Models for Location, Scale and Shape

Abstract

Tuning of model-based boosting algorithms relies mainly on the number of iterations, while the step-length is fixed at a predefined value. For complex models with several predictors such as Generalized Additive Models for Location, Scale and Shape (GAMLSS), imbalanced updates of predictors, where some distribution parameters are updated more frequently than others, can be a problem that prevents some submodels to be appropriately fitted within a limited number of boosting iterations. We propose an approach using adaptive step-length (ASL) determination within a non-cyclical boosting algorithm for GAMLSS to prevent such imbalance. Moreover, for the important special case of the Gaussian distribution, we discuss properties of the ASL and derive a semi-analytical form of the ASL that avoids manual selection of the search interval and numerical optimization to find the optimal step-length, and consequently improves computational efficiency. We show competitive behavior of the proposed approaches compared to penalized maximum likelihood and boosting with a fixed step-length for GAMLSS models in two simulations and two applications, in particular for cases of large variance and/or more variables than observations. In addition, the idea of the ASL is also applicable to other models with more than one predictor like zero-inflated count model, and brings up insights into the choice of the reasonable defaults for the step-length in simpler special case of (Gaussian) additive models.

Full PDF

AAdaptive Step-Length Selection in Gradient Boosting forGeneralized Additive Models for Location, Scale and Shape

Boyao Zhang ,* , Tobias Hepp , Sonja Greven , and Elisabeth Bergherr Department of Medical Informatics, Biometry and Epidemiology,Friedrich-Alexander Universit¨at Erlangen-N¨urnberg, Waldstrasse 6, 91054 Erlangen,Germany Chair of Statistics, School of Business and Economics, Humboldt-Universit¨at zuBerlin, Unter den Linden 6, 10099 Berlin, Germany * Corresponding author e-mail: [email protected] , Phone:+49-(0)9131-85-22729, FAX: +49-(0)9131-85-25740

February 19, 2021

Abstract

Tuning of model-based boosting algorithms relies mainly on the number of iterations, while thestep-length is ﬁxed at a predeﬁned value. For complex models with several predictors such as Gener-alized Additive Models for Location, Scale and Shape (GAMLSS), imbalanced updates of predictors,where some distribution parameters are updated more frequently than others, can be a problem thatprevents some submodels to be appropriately ﬁtted within a limited number of boosting iterations.We propose an approach using adaptive step-length (ASL) determination within a non-cyclical boost-ing algorithm for GAMLSS to prevent such imbalance. Moreover, for the important special case ofthe Gaussian distribution, we discuss properties of the ASL and derive a semi-analytical form of theASL that avoids manual selection of the search interval and numerical optimization to ﬁnd the optimalstep-length, and consequently improves computational efﬁciency. We show competitive behavior ofthe proposed approaches compared to penalized maximum likelihood and boosting with a ﬁxed step-length for GAMLSS models in two simulations and two applications, in particular for cases of largevariance and/or more variables than observations. In addition, the idea of the ASL is also applicable toother models with more than one predictor like zero-inﬂated count model, and brings up insights intothe choice of the reasonable defaults for the step-length in simpler special case of (Gaussian) additivemodels.

Keywords—

Step-Length, Gradient Boosting, GAMLSS, Variable Selection, Shrinkage

Generalized additive models for location, scale and shape (GAMLSS)[Rigby and Stasinopoulos (2005)]are distribution-based approaches, where all parameters of the assumed distribution for the responsecan be modelled as additive functions of the explanatory variables [Ripley (2004); Stasinopoulos et al.(2017)]. Speciﬁcally, the GAMLSS framework allows the conditional distribution of the response vari-able to come from a wide variety of discrete, continuous and mixed discrete-continuous distributions, seeStasinopoulos and Rigby (2007). Unlike conventional generalized additive models (GAMs), GAMLSSnot only model the location parameter, e.g. the mean for Gaussian distributions, but also further distri-bution parameters such as scale (variance) and shape (skewness and kurtosis) through the explanatoryvariables in linear, nonlinear or smooth functional form.The coefﬁcients of GAMLSS are usually estimated based on penalized maximum likelihood method[Rigby and Stasinopoulos (2005)]. However, this approach cannot deal with high dimensional data,or more precisely, the case of more variables than observations [B¨uhlmann (2006)]. As the selection1 a r X i v : . [ s t a t . M E ] F e b BOOSTED GAMLSS of informative covariates is an important part of practical analysis, Mayr et al. (2012) combined theGAMLSS framework with componentwise gradient boosting [B¨uhlmann and Yu (2003); Hofner et al.(2014); Hothorn et al. (2018)] such that variable selection and estimation can be performed simulta-neously. The original method cyclically updates the distribution parameters, i.e. all predictors will beupdated sequentially in each boosting iteration [Hofner et al. (2016)]. Because the levels of complexityvary across the prediction functions, separate stopping values are required for each distribution parame-ter. Consequently, these stopping values have to be optimized jointly as they are not independent of eachother. The commonly applied joint optimization methods like grid search are, however, computationallyvery demanding. For this reason, Thomas et al. (2018) proposed an alternative non-cyclical algorithmthat updates only one distribution parameter (yielding the strongest improvement) in each boosting itera-tion. This way, only one global stopping value is needed and the resulting one-dimensional optimizationprocedure vastly reduces computing complexity for the boosting algorithm compared to the previousmulti-dimensional one. The non-cyclical algorithm can be combined with stability selection [Mein-shausen and B¨uhlmann (2010); Hofner et al. (2015)] to further reduce the selection of false positives[Hothorn et al. (2010)].In contrast to the cyclical approach, the non-cyclical algorithm avoids an equal number of updatesfor all distribution parameters as it is not useful to artiﬁcially enforce updates for parameters with a lesscomplex structure than other parameters. However, it becomes even more important to fairly select thepredictor to be updated in any given iteration. The current implementation of Thomas et al. (2018),however, uses ﬁxed and equal step-lengths for all updates, regardless of the achieved loss reduction ofdifferent distribution parameters. As we demonstrate later, this leads to imbalanced updates that affectthe fair selection and predictors with large number of boosting iterations still tend to be underﬁtted.This seems inconsistent, since one expects the underﬁtted predictor to be updated with a few number ofiterations. As we show later, a large σ in a Gaussian distribution leads to a small negative gradient of µ and consequently the improvement for µ with ﬁxed small step-lengths in each boosting iteration will alsobe small. This results in the algorithm needing a lot of updates for µ until its empirical risk decreasesto the level of σ . However, the algorithm may stop long before the corresponding coefﬁcients are wellestimated.We address this problem by proposing a version of the non-cyclical boosting algorithm for GAMLSSthat adaptively and automatically optimizes the step-lengths for all predictors in each boosting iteration.The new approach leads to a fair selection of predictors to update. While it does not enforce equalnumbers of updates for all distribution parameters, it yields a natural balance in the updates. For theGaussian distribution, we also derive (semi-)analytical adaptive step-lengths that decrease the need fornumerical optimization and discuss their properties. Our ﬁndings have implications beyond boostedGAMLSS models for boosting other models with several predictors, e.g. for zero-inﬂated count models,and also give insights into the step-length choice for the simpler special case of (Gaussian) additivemodels.The structure of this paper is organized as follows: Section 2 introduces the boosted GAMLSS modelsincluding the cyclical and non-cyclical algorithms. Section 3 discusses how to apply the adaptive step-length on the non-cyclical boosted GAMLSS algorithm, and introduces the semi-analytical solutionsof the adaptive step-length in the Gaussian distribution with their properties. Section 4 evaluates theperformance of the adaptive algorithms and the problem of ﬁxed step-length in two simulations. Section5 presents the application of the adaptive algorithms for two datasets: the malnutrition data, where theoutcome variance is very large, and the riboﬂavin data, which has more variables than observations.Section 6 concludes with a summary and discussion. Further relevant materials and results are includedin the appendix. In this section, we brieﬂy introduce the GAMLSS models and the two cyclical and noncyclical boostingmethods for estimation. 2 .1 GAMLSS and componentwise gradient boosting 2 BOOSTED GAMLSS

Conventional generalized additive models (GAM) assume a dependence only of the conditional mean µ of the response on the covariates. GAMLSS, however, also model other distribution parameters such asthe scale σ , skewness ν and/or kurtosis τ with a set of statistical models.The K distribution parameters θ T = ( θ , θ , · · · , θ K ) of a density function f ( y | θ ) are modelled bya set of up to K additive models. The model class assumes that the observations y i for i ∈ { , · · · , n } are conditionally independent given a set of explanatory variables. Let y T = ( y , y , · · · , y n ) be a vectorof the response variable and X be a n × J data matrix. In addition, we denote X i · , X · j and X ij as the i -th observation (vector of length J ), j -variable (vector of length n ) and the i -th observation of the j -thvariable (a single value) respectively. Let g k ( · ) , k = 1 , · · · , K be known monotonic link functions thatrelate K distribution parameters to explanatory variables through additive models given by g k ( θ k ) = η θ k ( X ) = β , θ k n + J (cid:88) j =1 f j, θ k ( X · j | β j, θ k ) k = 1 , . . . , K, (1)where θ k = ( θ k, , · · · , θ k,n ) T contains the n parameter values for the n observations and functions areapplied elementwise if the argument is a vector, η θ k is a vector of length n , n is a vector of ones and β , θ k is the model parameter speciﬁc intercept. Function f j, θ k ( X · j | β j, θ k ) indicates the effects of the j -thexplanatory variable X · j (vector of length n ) for the model parameter θ k , and β j, θ k is the parameter ofthe additive predictor f j, θ k ( · ) . Various types of effects (e.g., linear, smooth, random) for f ( · ) are allowed.If the location parameter ( θ = µ ) is the only distribution parameter to be regressed ( K = 1 ) and theresponse variable is from the exponential family, (1) reduces to the conventional GAM. In addition, f j can depend on more than one variable (interaction), in which case X · j would be e.g. a n × matrix, butfor simplicity we ignore this case in the notation.A penalized likelihood approach can be used to estimate the unknown quantities; for more details,see Rigby and Stasinopoulos (2005). However, this approach does not allow parameter estimation in thecase of more explanatory variables than observations, and variable selection for high-dimensional data isnot possible. To deal with these problems, Mayr et al. (2012) proposed a boosted GAMLSS algorithm,which estimates the predictors in GAMLSS with componentwise gradient boosting [B¨uhlmann and Yu(2003)]. As this method updates in general only one variable in each iteration, it can deal with datathat has more variables than observations, and the important variables can be selected by controlling thestopping iterations.To estimate the unknown predictor parameters β j, θ k , j ∈ { , · · · , J } in equation (1), the componen-twise gradient boosting algorithm minimizes the empirical risk R , which is also the loss ρ summed overall observations, R = n (cid:88) i =1 ρ ( y i , η ( X i · )) , where the loss ρ measures the discrepancy between the response y i and the predictor η ( X i · ) . Thepredictor η ( X i · ) = ( η θ ( X i · ) , · · · , η θ K ( X i · )) is a vector of length K . For the i -th observation X i · ,each predictor η θ k ( X i · ) is a single value corresponding to the i -th entry in η θ k in equation (1). Theloss function ρ usually used in GAMLSS is the negative log-likelihood of the assumed distribution of y [Thomas et al. (2018); Friedman et al. (2000)].The main idea of gradient boosting is to ﬁt simple regression base-learners h j ( · ) to the pseudo-residuals vector u T = ( u , · · · , u n ) , which is deﬁned as the negative partial derivatives of loss ρ , i.e., u [ m ] k = (cid:18) − ∂∂η θ k ρ ( y, η ) (cid:12)(cid:12)(cid:12) η =ˆ η [ m − ( X i · ) ,y = y i (cid:19) i =1 , ··· ,n , where m denotes the current boosting iteration. In a componentwise gradient boosting iteration, eachbase-learner involves usually one explanatory variable (interactions are also allowed) and is ﬁtted sepa-rately to u [ m ] k , u [ m ] k base-learner −→ ˆ h [ m ] j, θ k ( X · j ) for j = 1 , · · · , J. .2 Cyclical boosted GAMLSS 2 BOOSTED GAMLSS For linear base-learner, its correspondence to the model terms in (1) shall be ˆ h j, θ k ( X · j ) = X · j ˆ β j , where the estimated coefﬁcients can be obtained by using the maximum likelihood or least squaremethod. The best-ﬁtting base-learner is selected based on the residual sum of squares, i.e., j ∗ = arg min j ∈{ , ··· ,J } n (cid:88) i =1 (cid:16) u k,i − ˆ h j ( X ij ) (cid:17) , thereby allowing for easy interpretability of the estimated model and also the use of hypothesis tests forsingle base-learners [Hepp et al. (2019)]. The additive predictor will be updated based on the best-ﬁttingbase-learner ˆ h j ∗ ,θ k ∗ ( X · j ∗ ) in terms of the best-performing sub-model η θ k ∗ , ˆ η [ m ] θ k ∗ ( X ) = ˆ η [ m − θ k ∗ ( X ) + ν ˆ h j ∗ , θ k ∗ ( X · j ∗ ) , (2)where ν denotes the step-length. In order to prevent overﬁtting, the step-length is usually set to a smallvalue, in most cases 0.1. Equation (2) updates only the best-performing predictor ˆ η [ m ] θ k ∗ , all other pre-dictors (i.e. for k (cid:54) = k ∗ ) remain the same as in the previous boosting iteration. The best-performingsub-model θ k ∗ can be selected by comparing the empirical risk, i.e. which model parameter achieves thelargest model improvement.The main tuning parameter in this procedure, as in other boosting algorithms, is how many iterationsshould be performed before it stops, which is denoted as m θ stop . As too large or small m θ stop leads toover-/underﬁtting model, cross-validation [Kohavi et al. (1995)] is one of the most widely used methodsto ﬁnd the optimal m θ stop . The boosted GAMLSS can deal with data that has more variables than observations, as the compo-nentwise gradient boosting updates only one variable in each iteration. It leads to variable selection ifsome less important variables have never been selected as the best-performing variable and thus are notincluded in the ﬁnal model for a given stopping iteration m θ stop .The original framework of boosted GAMLSS proposed by Mayr et al. (2012) is a cyclical approach,which means every predictor η θ k , k ∈ { , · · · , K } is updated in a cyclical manner inside each boostingiteration. The iteration starts by updating the predictor for the location parameter and uses the predictorsfrom the previous iteration for all other parameters. Then, the updated location model will be used forupdating the scale model and so on. A schematic overview of the updating process in iteration m + 1 for K = 4 is ( ˆ µ [ m ] , ˆ σ [ m ] , ˆ ν [ m ] , ˆ τ [ m ] ) update −→ ˆ η [ m +1] µ → ˆ µ [ m +1] ( ˆ µ [ m +1] , ˆ σ [ m ] , ˆ ν [ m ] , ˆ τ [ m ] ) update −→ ˆ η [ m +1] σ → ˆ σ [ m +1] ( ˆ µ [ m +1] , ˆ σ [ m +1] , ˆ ν [ m ] , ˆ τ [ m ] ) update −→ ˆ η [ m +1] ν → ˆ ν [ m +1] ( ˆ µ [ m +1] , ˆ σ [ m +1] , ˆ ν [ m +1] , ˆ τ [ m ] ) update −→ ˆ η [ m +1] τ → ˆ τ [ m +1] . However, not all of the distribution parameters have the same complexity, i.e., the stopping iterations m θ stop should be set separately for different parameters, or jointly optimized, for example by grid search.Since grid search scales exponentially with the number of distribution parameters, such optimization canbe very slow. In order to deal with the issues of a cyclical approach, Thomas et al. (2018) proposed a non-cyclical ver-sion, that updates only one distribution parameter instead of successively updating all parameters in eachboosting iteration by comparing the model improvement (negative log-likelihood) of each model parame-ter, see Algorithm 1 (especially step 11). Consequently, instead of specifying separate stopping iterations4 .3 Non-cyclical boosted GAMLSS 2 BOOSTED GAMLSS m θ stop for different parameters and tuning them with the computationally demanding grid search, onlyone overall stopping iteration, denoted as m stop , needs to be tuned with e.g. the line search [Friedman(2001); Brent (2013)]. The tuning problem thus reduces from a multi-dimensional to a one-dimensionalproblem, which vastly reduces the computing time. Algorithm 1

Non-cyclical componentwise gradient boosting in multiple dimensions - Basic algorithm Initialize the additive predictors ˆ η [0] = (cid:16) ˆ η [0] θ , · · · , ˆ η [0] θ K (cid:17) with offset values. For each distribution parameter θ k , k = 1 , · · · , K , specify a set of base-learners, i.e., for parameter θ k deﬁne h , θ k ( · ) , · · · , h J k , θ k ( · ) where J k is the cardinality of the set of base-learners speciﬁed for θ k . for m = 1 to m stop do for k = 1 to K do Compute negative partial derivatives − ∂∂η θ k ρ ( y, η ) and plug in the current estimates ˆ η [ m − ( · ) : u [ m ] k = (cid:18) − ∂∂η θ k ρ ( y, η ) (cid:12)(cid:12)(cid:12) η =ˆ η [ m − ( X i · ) ,y = y i (cid:19) i =1 , ··· ,n . Fit (e.g. with the least square method) the negative gradient vector u [ m ] k separately to everybase-learner: u [ m ] k base-learner −→ ˆ h j, θ k ( X · j ) for j = 1 , · · · , J k . Select the best-ﬁtting base-learner ˆ h j ∗ , θ k ( X · j ∗ ) by inner loss, i.e., the residual sum of squaresof the base-learner ﬁt w.r.t. u [ m ] k = (cid:16) u [ m ] k, , · , u [ m ] k,n (cid:17) T : j ∗ = arg min j ∈{ , ··· ,J k } n (cid:88) i =1 (cid:16) u [ m ] k,i − ˆ h j, θ k ( X ij ) (cid:17) , where we dropped the dependence of j ∗ on k in the notation for simplicity. Set the step-length to a ﬁxed value ν , usually ν = 0 . : ν [ m ] θ k = ν Compute the possible improvement of this update regarding the outer loss ∆ ρ k = n (cid:88) i =1 ρ (cid:16) y i , ˆ η [ m − θ k ( X i · ) + ν [ m ] θ k · ˆ h j ∗ , θ k ( X ij ∗ ) (cid:17) . end for Update, depending on the value of the loss reduction, only the overall best-ﬁtting base-learner k ∗ = arg min k ∈{ , ··· ,K } ∆ ρ k : ˆ η [ m ] θ k ∗ ( X ) = ˆ η [ m − θ k ∗ ( X ) + ν [ m ] θ k · ˆ h j ∗ , θ k ∗ ( X · j ∗ ) . Set ˆ η [ m ] θ k := ˆ η [ m − θ k for all k (cid:54) = k ∗ . end for The cyclical approach led to an inherent but somewhat artiﬁcial balance between the distribution pa-rameters, as the predictors for all distribution parameters are updated in each iteration. The different ﬁnalstopping values m θ stop for the different distribution parameters - chosen by tuning methods such as cross-validation - allow to stop updates at different times for distribution parameters of different complexityto avoid overﬁtting. In the non-cyclical algorithm, especially when m stop is not large enough, there is5 ADAPTIVE STEP-LENGTH the danger of an imbalance between predictors. If the selection between predictors to update is not fair,this could lead to iterations primarily updating some of the predictors and underﬁtting others. We willprovide a detailed example for the Gaussian distribution with large σ in Section 4.2.A related challenge is to choose an appropriate step-length ν [ m ] θ k for both the cyclical and non-cyclicalapproaches. Tuning the parameters when boosting GAMLSS models relies mainly on the number ofboosting iterations ( m stop ), with the step-length ν usually set to a small value such as 0.1. B¨uhlmann andHothorn (2007) argued that using a small step-length like 0.1 (potentially resulting in a larger numberof iterations m stop ) had a similar computing speed as using an adaptive step-length performed by doinga line search, but meant an easier tuning task for one parameter ( m stop ) instead of two. However, thisresult referred to models with a single predictor. A ﬁxed step-length can lead to an imbalance in thecase of several predictors that may live on quite different scales. For example, 0.1 may be too smallfor µ but large for σ . We will discuss such cases analytically and with empirical evidence in the latersections. Moreover, varying the step-lengths for the different sub-models directly inﬂuences the choiceof the best performing sub-model in the non-cyclical boosting algorithm, thus choosing a subjective step-length is not appropriate. In the following, we denote a ﬁxed predeﬁned step-length such as 0.1 as the ﬁxed step-length (FSL) approach.To overcome the problems stated above, we propose using an adaptive step-length (ASL) while boost-ing. In particular, we propose to optimize the step-length for each predictor in each iteration to obtaina fair comparison between the predictors. While the adaptive step-length has been used before, the pro-posal to use different ASLs for different predictors is new and we will see that this leads to balancedupdates of the different predictors. In this section, we ﬁrst introduce the general idea of the implementation of adaptive step-lengths fordifferent predictors to GAMLSS. For the important special case of a Gaussian distribution with twomodel parameters ( µ and σ ), we will derive and discuss their adaptive step-lengths and properties, whichalso serves as an important illustration of the relevant issues more generally. Unlike the step-length in equation (2) and Algorithm 1, step 11, the adaptive step-length may also varyin different boosting iterations according to the loss reduction.The adaptive step-length can be derived by solving the optimization problem ν ∗ [ m ] j ∗ , θ k = arg min ν n (cid:88) i =1 ρ (cid:16) y i , ˆ η [ m − θ k ( X i · ) + ν · ˆ h j ∗ , θ k ( X ij ∗ ) (cid:17) , (3)note that ν ∗ [ m ] j ∗ , θ k is the optimal step-length of the model parameter θ k dependent on j ∗ in iteration m . Theoptimal step-length is a value that leads to the largest decrease possible of the empirical risk and usuallyleads to overﬁtting of the corresponding variable if no shrinkage is used [Hepp et al. (2016)]. Thereforethe actual adaptive step-length (ASL) we apply in the boosting algorithm is the product of two parts, theshrinkage parameter λ and the optimal step-length ν ∗ [ m ] j ∗ , θ k , i.e., ν [ m ] j ∗ , θ k = λ · ν ∗ [ m ] j ∗ , θ k . In this article, we take λ = 0 . , thus 10% of the optimal step-length. By comparison, the ﬁxed step-length ν = 0 . would correspond to a combination of a shrinkage parameter λ = 0 . with the “optimal”step-length ν ∗ set to one.The non-cyclical algorithm with ASL can be improved by replacing the ﬁxed step-length in step 8 ofalgorithm 1 with the adaptive one. We formulate this change in Algorithm 2.As the parameters in GAMLSS may have quite different scales, updates with ﬁxed step-length canlead to an imbalance between model parameters, especially when m stop is not large enough. When usingFSL, a single update for predictor η θ may achieve the same amount of global loss reduction than several6 .2 Gaussian distribution speciﬁcation 3 ADAPTIVE STEP-LENGTH Algorithm 2

Non-cyclical componentwise gradient boosting with adaptive step-length - Extension ofbasic algorithm 1 · · ·

Steps 1-7 equal to algorithm 1 · · · , in addition, choose shrinkage parameter λ . Find the optimal step-length ν [ m ] θ k by optimizing the outer loss: ν ∗ [ m ] j ∗ , θ k = arg min ν n (cid:88) i =1 ρ (cid:16) y i , ˆ η [ m − θ k ( X i · ) + ν · ˆ h j ∗ , θ k ( X ij ∗ ) (cid:17) , and set adaptive step-length ν [ m ] j ∗ , θ k as the optimal value with shrinkage λ : ν [ m ] j ∗ , θ k = λ · ν ∗ [ m ] j ∗ , θ k . · · · Steps 9-13 equal to those in algorithm 1 · · · updates of another predictor η θ even if the actually possible contribution of the competing base-learnersis similar, because for different scales the loss reductions of η θ in these iterations are always smallerthan that of η θ . However, such unfair selections can be avoided by using ASL, because the modelimprovement depends on the largest decrease possible of each predictor, i.e., the potential reductionin the empirical risks of all predictors are on the same level and their comparison therefore is fair. Fairselection does not enforce an equal number of updates as in the cyclical approach. The ASL approach canlead to imbalanced updates of predictors, but such imbalance actually reveals the intrinsically differentcomplexities of each sub-model.The main contribution of this paper is the proposal to use ASLs for each predictor in GAMLSS.This idea can also be applied to other complex models (e.g. zero-inﬂated count models) with severalpredictors for the different parameters, because these models meet the same problem, i.e. the scale ofthese parameters might differ considerably. If a boosting algorithm is preferred for estimation of such amodel, we provide a new solution to address these kinds of problems, i.e. separate adaptive step-lengthsfor each distribution parameter. In general, the adaptive step-length ν can be found by optimizing procedures such as a line search.However, such methods do not help to reveal the properties of adaptive step-lengths and its relationshipwith model parameters. Moreover, a line search method searches for the optimal value from a predeﬁnedsearch interval, which can be difﬁcult to ﬁnd out since too narrow intervals might not include the optimalvalue and too large intervals increase the searching time. The direct computation from an analyticalexpression is faster than a search. By investigating the important special case of a Gaussian distributionwith two parameters, we will learn a lot about the adaptive step-length for the general case.Consider the data points ( y i , x i · ) , i ∈ { , · · · , n } , where x is a n × J matrix. Assume the true datagenerating mechanism is the normal model y i ∼ N ( µ i , σ i ) µ i = η µ ( x i · ) σ i = exp ( η σ ( x i · )) . As we talk about the observed data, we replace η θ k , where k = 1 , for Gaussian distribution, with µ and σ , and replace X with x . The identity and exponential functions for µ and σ are thus the correspondinginverse link. Taking the negative log-likelihood as the loss function, its negative partial derivatives u µ and u σ in iteration m for both parameters can then be modelled with the base-learners ˆ h [ m ] j, µ and ˆ h [ m ] j, σ .The optimization process can then be divided into two parts: one is the ASL for the location parameter µ , and the other is for the scale parameter σ . As the ASL shrinks the optimal value, we consider onlythe optimal step-lengths for both parameters. 7 .2 Gaussian distribution speciﬁcation 3 ADAPTIVE STEP-LENGTH µ The analytical optimal step-length for µ in iteration m is obtained through minimizing the empirical risk ν ∗ [ m ] j ∗ , µ = arg min ν n (cid:88) i =1 ρ (cid:16) y i , { ˆ η [ m ] µ ( x i · ) , ˆ η [ m − σ ( x i · ) } (cid:17) = arg min ν n (cid:88) i =1 (cid:16) y i − ˆ η [ m − µ ( x i · ) − ν ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) σ m − i , (4)where the expression ˆ σ m − i represents the square of the standard deviation in the previous iteration,i.e. ˆ σ m − i = (ˆ σ [ m − i ) . The optimal value of ν ∗ [ m ] j ∗ , µ is obtained by letting the derivative of the equationequal zero, so we get the analytical ASL for µ (for more derivation details, see also appendix A.1): ν ∗ [ m ] j ∗ , µ = (cid:80) ni =1 (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) (cid:80) ni =1 (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) ˆ σ m − i . (5)It is obvious, that ν ∗ [ m ] j ∗ , µ is not an independent parameter in GAMLSS but depends on the base-learner ˆ h [ m ] µ ( x ij ∗ ) with respect to the best performing variable x · j ∗ and the estimated variance in the previousiteration ˆ σ m − i .In the special case of a Gaussian additive model, the scale parameter σ is assumed to be constant, i.e. ˆ σ [ m − i = ˆ σ [ m − for all i ∈ { , · · · , n } . We then obtain ν ∗ [ m ] j ∗ , µ = (cid:80) ni =1 (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) σ m − (cid:80) ni =1 (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) = ˆ σ m − . (6)This gives us an interesting property of the optimal step-length or ASL, i.e., the analytical ASL for µ inthe Gaussian distribution is actually the variance (as computed in the previous boosting iteration). Thisproperty enables this paper to be not only applicable for the special GAMLSS case, but also for theboosting of additive models with normal responses. Therefore, in the case of Gaussian additive models,we can use ν [ m ] j ∗ , µ = λ ˆ σ m − as the step-length, which has a stronger theoretical foundation, instead ofthe common choice 0.1.Back to the general GAMLSS case, we can further investigate the behavior of the step-length byconsidering the limiting case of m → ∞ . For large m , all base-learner ﬁts ˆ h [ m ] j ∗ , µ ( x ij ∗ ) converge to zeroor are similarly small. If we consequently approximate all ˆ h [ m ] j ∗ , µ ( x ij ∗ ) by some small constant h , thisgives an approximation of the analytical optimal step-length of ν ∗ [ m ] j ∗ , µ ≈ (cid:80) ni =1 h (cid:80) ni =1 h ˆ σ m − i = nh h (cid:80) ni =1 1ˆ σ m − i = n (cid:80) ni =1 1ˆ σ m − i , (7)which is the harmonic mean of the estimated variances ˆ σ m − i in the previous iteration. While thisexpression requires m to be large, which may not be reached if m stop is of moderate size to preventoverﬁtting, the expression still gives an indication of the strong dependence of the optimal step-length onthe variances ˆ σ m − i , which generalizes the optimal value of the additive model in (6).8 .3 (Semi-)Analytical adaptive step-length 3 ADAPTIVE STEP-LENGTH σ The optimal step-length for the scale parameter σ can be obtained analogously by minimizing the empir-ical risk, now with respect to ν ∗ [ m ] j ∗ , σ . We obtain ν ∗ [ m ] j ∗ , σ = arg min ν n (cid:88) i =1 ρ (cid:16) y i , { ˆ η [ m − µ ( x i · ) , ˆ η [ m ] σ ( x i · ) } (cid:17) = arg min ν n (cid:88) i =1 (cid:16) ˆ η [ m − σ ( x i · ) + ν ˆ h [ m ] σ ( x ij ∗ ) (cid:17) + n (cid:88) i =1 (cid:16) y i − ˆ η [ m − µ ( x i · ) (cid:17) (cid:16) η [ m − σ ( x i · ) + 2 ν ˆ h [ m ] σ ( x ij ∗ ) (cid:17) . (8)After checking the positivity of the second-order derivative of the expression in equation (8), the optimalvalue can be obtained by setting the ﬁrst-order derivative equal to zero: n (cid:88) i =1 ˆ h [ m ] σ ( x ij ∗ ) − n (cid:88) i =1 (cid:16) ˆ h [ m ] σ ( x ij ∗ ) + (cid:15) i, σ + 1 (cid:17) ˆ h [ m ] σ ( x ij ∗ )exp (cid:16) ν ∗ [ m ] j ∗ , σ ˆ h [ m ] σ ( x ij ∗ ) (cid:17) ! = 0 , (9)where (cid:15) i, σ denotes the residuals when regressing the negative partial derivatives u [ m ] σ ,i on the base-learner ˆ h [ m ] σ ( x ij ∗ ) , i.e., u σ ,i = ˆ h [ m ] σ ( x i · ) + (cid:15) i, σ . Unfortunately, equation (9) cannot be further simpliﬁed, whichmeans that there is no analytical ASL for the scale parameter σ in the Gaussian distribution. Hence,the optimal ASL must be found by performing a conventional line search. For more details, see alsoappendix A.2.Even without an analytical solution, we can still use (9) to further study the behavior of the ASL.Analogous to the derivation of (7), ˆ h [ m ] σ ( x ij ∗ ) converges to zero for m → ∞ . If we approximate with a(small) constant ˆ h [ m ] σ ( x ij ∗ ) ≈ h, ∀ i ∈ { , · · · , n } . Then (9) simpliﬁes to n (cid:88) i =1 h − n (cid:88) i =1 ( h + (cid:15) i, σ + 1) h exp (cid:16) ν ∗ [ m ] j ∗ , σ h (cid:17) = 0 ⇔ ν ∗ [ m ] j ∗ , σ = 12 h log (cid:32) h + 1 + 1 n n (cid:88) i =1 (cid:15) i, σ (cid:33) ⇔ ν ∗ [ m ] j ∗ , σ = 12 h log( h + 1) , (10)where n (cid:80) ni =1 (cid:15) i, σ = 0 in the regression model. Equation (10) can be further simpliﬁed by approximat-ing the logarithm function with a Taylor series at h = 0 , thus ν ∗ [ m ] j ∗ , σ = 12 h (cid:18) h − h O ( h ) (cid:19) = 12 − h O ( h ) . As h → for m → ∞ , the limit of this approximate optimal step-length for σ is lim m →∞ ν ∗ [ m ] j ∗ , σ = lim h → − h . (11)Thus, the ASL for σ approaches approximately 0.05 if we take the shrinkage parameter λ = 0 . anditerations run for a longer time (and the boosting algorithm is not stopped too early to prevent overﬁttingfor this trend to show). Knowing the properties of the analytical ASL in boosting GAMLSS for the Gaussian distribution, we canreplace the line search with the analytical solution for the location parameter µ . If we keep the line search9 SIMULATION STUDY for the scale parameter σ , we call this the Semi-Analytical Adaptive Step-Length (SAASL) . Moreover, weare interested in the performance of combining the analytical ASL for µ with the approximate value .

05 = λ · (with λ = 0 . ) for the ASL for σ , which is motivated by the limiting considerationsdiscussed above and has a better theoretical foundation than selecting an arbitrary small value in thecommon FSL. We call this step-length setup SAASL05 . In either of these cases, it is straightforward andcomputationally efﬁcient to obtain the (approximate) optimal value(s) and both alternatives are fasterthan performing two line searches.The semi-analytical solution avoids the need for selecting a search interval for the line search, at leastfor the ASL for µ in the case of SAASL and for both parameters for SAASL05. This is an advantage,since too large search intervals will cause additional computing time, but too small intervals may missthe optimal ASL value and again lead to an imbalance of updates between the parameters. Also note thatthe value 0.5 gives an indication for a reasonable range for the search interval for ν ∗ [ m ] j ∗ , σ if a line search isconducted after all .The boosting GAMLSS algorithm with ASL for the Gaussian distribution is shown in Algorithm 3. Algorithm 3

Non-cyclical componentwise gradient boosting for the Gaussian distribution with differentstep-lengths - Extension of basic algorithm 1 · · ·

Steps 1-7 equal to algorithm 1 · · · , in addition, choose shrinkage parameter λ . Set or ﬁnd the step-length ν [ m ] j ∗ , θ k for θ k ∈ { µ , σ } by one of the followings:• Adaptive step-length (ASL): ν ∗ [ m ] j ∗ , θ k = arg min ν n (cid:88) i =1 ρ (cid:16) y i , ˆ η [ m − θ k ( x i · ) + ν · ˆ h j ∗ , θ k ( x ij ∗ ) (cid:17) ; • Semi-analytical adaptive step-length (SAASL):if θ k = µ , ν ∗ [ m ] j ∗ , µ = (cid:80) ni =1 (cid:16) ˆ h j ∗ , µ ( x ij ∗ ) (cid:17) (cid:80) ni =1 ( ˆ h j ∗ , µ ( x ij ∗ ) ˆ σ m − i , if θ k = σ , same as for ASL.• Semi-analytical adaptive step-length (SAASL05):if θ k = µ , same as for SAASL,if θ k = σ , ν ∗ [ m ] j ∗ , θ k = 0 . .and set adaptive step-length ν [ m ] j ∗ , θ k as the optimal value with shrinkage λ : ν [ m ] j ∗ , θ k = λ · ν ∗ [ m ] j ∗ , θ k . · · · Steps 9-13 equal to those in algorithm 1 · · ·

For a chosen shrinkage parameter of λ = 0 . , the ν σ in SAASL05 would be 0.05, which is a smalleror “less aggressive” value than 0.1 in FSL, leading to a somewhat larger number of boosting iterationsbut a smaller risk of overﬁtting, and to a better balance with the ASL for µ . In the following, two simulations are shown to demonstrate the performance of the adaptive algorithms.The ﬁrst one compares the estimation accuracy between the different non-cyclical boosted GAMLSSalgorithms with FSL or ASL in a Gaussian regression model for location and scale. The second oneunderlines the problem of FSL and the performance of the adaptive approaches if the variance in thissetting is large. 10 .1 Gaussian Location and Scale Model 4 SIMULATION STUDY

The simulation study in Thomas et al. (2018) showed that their FSL non-cyclical approach outperformsthe classical cyclical approach. We use the same setup to show that the ASL approach performs at leastas good as the FSL non-cyclical approach (and hence also outperforms the classical cyclical approach).At the end of this subsection we will show that the reason for the good performance of FSL is due tothe chosen simulated data structure. The setup is the following: the response y i is drawn from N ( µ i , σ i ) for n = 500 observations, with 6 informative covariates x ij , j ∈ { , · · · , } drawn independently fromUni ( − , . The predictors of both distribution parameters are: η µ ( x i · ) = µ i = x i + 2 x i + 0 . x i − x i η σ ( x i · ) = log( σ i ) = 0 . x i + 0 . x i − . x i − . x i , where x and x are shared between both µ and σ . Moreover, p n-inf = 0 , , or non-informativevariables sampled from Uni ( − , are also added to the model. We conduct B = 100 simulation runs.The estimated coefﬁcients of η µ and η σ , whose values are taken at stopping iterations tuned by 10-fold CV with the maximum number of boosting iterations set to 1000, are shown in Appendix FiguresB.1 and B.2.Overall, the estimated coefﬁcients are similar between all four methods, with the shrinkage bias ofboosting only becoming apparent with an increasing number of noise variables. m s p n - i n f = n - i n f = n - i n f = n - i n f = FSL ASL SAASL SAASL05 FSL ASL SAASL SAASL050.000.050.100.000.050.100.000.050.100.000.050.10 M ean S qua r ed E rr o r Figure 4.1: Comparison between mean squared error for FSL and the three ASL methods. The leftcolumn comprises the MSE for η µ , the right column for η σ . The different numbers of non-informativevariables are represented row-wise.Figure 4.1 shows the comparison of the mean squared error (MSE) among non-cyclical boostedalgorithms for µ and σ , where the MSEs are deﬁned on the predictor level as MSE µ = n (cid:80) ni =1 ( µ i − .1 Gaussian Location and Scale Model 4 SIMULATION STUDY η µ ( x i · )) and MSE σ = n (cid:80) ni =1 (log( σ i ) − η σ ( x i · )) , respectively. In general, all methods have a similarMSE, with the MSE of FSL increasing more strongly with the number of non-informative variables p n-inf and ASL methods hence slightly outperform FSL in the variance predictor for a high number ofnon-informative variables. ASL and SAASL show identical results, as they should if the line search iscorrectly conducted, with results returned by SAASL05 very similar.Computing the negative log-likelihood in sample of the model ﬁts reveals a slight advantage for FSL(see Appendix Figure B.3). However, this can be linked to the fact that FSL selects more false positivevariables on average than the adaptive approaches and thus shows a relatively stronger tendency to overﬁtthe training data (Figure 4.2).For σ , even if p n-info is small, the false positive rates of the adaptive approaches are notably smallerthan those of FSL. As discussed above, ν [ m ] j ∗ , σ ≈ . for large m in the adaptive approach is smallerthan ν σ = 0 . for FSL. An update with a smaller, conservative step-length can apparently help to avoidoverﬁtting and the adaptive step-length here seems to strike the balance between learning speed and thenumber of false positives. While it would also be possible to lower the step-length for FSL to reducethe number of non-informative variables included in the ﬁnal model, this would increase the number ofboosting iterations and the computing time, and it would not address the imbalance between updates for µ and σ . The optimal choice of the step-length is also difﬁcult without further tuning or an automaticselection as in ASL. m s p n - i n f = n - i n f = n - i n f = n - i n f = FSL ASL SAASL SAASL05 FSL ASL SAASL SAASL050.00.51.01.52.001020300102030405001020304050 F a l s e P o s i t i v e Figure 4.2: Comparison between false positives for FSL and the three ASL methods. The left columncomprises the false positives for µ , the right column for σ . The different numbers of non-informativevariables settings are represented row-wise.In Figure 4.3 we show an example of the comparison between the optimal step-lengths in this case.As can be seen, the step-lengths for σ (depicted in grey) converge to . as shown in section 3.2.2. Thesecond fact that becomes obvious when looking at the ﬁgure is that the optimal step-lengths for bothpredictors do not differ a lot. Even though differences can be observed in early iterations in particular,12 .2 Large Variance with resulting Imbalance between Location and Scale 4 SIMULATION STUDY the step-lengths still have the same order of magnitude. This is not only the case for this example butoverall in this simulation setup. Having this in mind, the similar results for both approaches (FSL andASL) are not very surprising anymore: there is hardly any difference in the approaches, since the updatesdo not need different step-lengths to be balanced. In the next subsection we will examine a case inwhich the data calls for different step-lengths, and see how both methods perform under those changedcircumstances. Iterations S t ep Leng t h step length n m n s Figure 4.3: Comparison of the optimal step-lengths ν ∗ [ m ] j ∗ , µ and ν ∗ [ m ] j ∗ , σ in SAASL from one of the simulation runs. The step-lengths for µ are in black dots, the step-lengths for σ in grey cross. Differenthorizontal layers of dots/crosses correspond to different covariates. As discussed above, the Gaussian location and scale model in section 4.1 did not lead to a large differencebetween FSL and ASL, as the optimal step-lengths for µ and σ were roughly similar and the imbalancebetween the updates for the two predictors in FSL was thus not large. In this section, we investigate asetting with a large variance, which leads to a stronger imbalance between the two parts of the model.In the following, we use SAASL as a representative of the adaptive approaches in our presentation,as it yields identical results to ASL, but avoids the numerical search for the optimal ν µ by using theanalytical result (5). Since estimated effects generally deviated more strongly from the theoretical valuesthan before due to the large variance (details will be discussed later), we additionally compared the resultsto those obtained using GAMLSS with penalized maximum likelihood estimation as implemented in theR-package gamlss [Rigby and Stasinopoulos (2005)].Consider the data generating mechanism y i ∼ N ( µ i , σ i ) , i ∈ { , · · · , } with B = 100 simulationruns. The predictors are determined by η µ ( x i · ) = µ i = 1 + x i + 2 x i − x i η σ ( x i · ) = log( σ i ) = 5 + 0 . x i − . x i + 0 . x i , where x · j ∼ Uni ( − , , j ∈ { , , , , } , x · and x · are noise variables. Note that this choiceof η σ leads to an extremely large standard deviation in the order of 150 due to the large intercept 5.The stopping iteration is obtained by 10-fold CV, and the maximum number of iterations is 3000 and2,000,000 for SAASL and FSL respectively.As can be seen in Figure 4.4, both ﬁxed and adaptive step-lengths yield reasonable estimates regarding η σ , but FSL results in many false negative estimates equal to zero for η µ in the majority of the simulationruns. This is of course connected to the relative importance of the variance component in this setting,which should in itself already lead to a preference for updating η σ rather than η µ in early boostingiterations due to the fact that the negative gradient for µ (i.e. u µ ,i = (cid:80) ni =1 ( y i − µ i ) /σ i with large σ i )is actually scaled by the variance (recall the large intercept 5, log-link and the resulting exponentialtransformation) and hence very small. As a consequence, the impact on the global loss of base-learners ﬁt13 .2 Large Variance with resulting Imbalance between Location and Scale 4 SIMULATION STUDY to the gradient is also small compared to those suggested for updates regarding σ in step 11 of Algorithm1. Then, using the same step-length for both parameters makes it clearly harder to identify informativeeffects on µ as they are trivialized in comparisons. m s x1 x2 x3 x4 x5 x1 x2 x3 x4 x5−0.20.00.2−20020 E s t i m a t ed c oe ff i c i en t Algorithm

FSL GAMLSS SAASL

Figure 4.4: Distribution of coefﬁcient estimates from B = 100 simulation runs. The true coefﬁcients aremarked by the dashed horizontal lines.The adaptive step-lengths implemented in SAASL compensates for this disadvantage. Compared tothe simulation results in the previous subsection the estimates regarding η µ are less precise with largevariability around the true values. This is not a problem of SAASL but again the consequence of the largevariance, obscuring the effects on the mean, and also encountered using the penalized maximum likeli-hood approach implemented in the gamlss -package (called GAMLSS in Figure 4.4). The variabilityin the estimates is actually somewhat smaller than for GAMLSS due to the regularization inherent in theboosting approach. This is also illustrated in Figure 4.5 in the pairwise comparison of the estimated co-efﬁcients for both methods, where SAASL leads to similar but slightly closer to zero estimates comparedto the penalized maximum likelihood based method GAMLSS. s x1 s x2 s x3 s x4 s x5 m x1 m x2 m x3 m x4 m x50.00 0.05 0.10 0.15 0.20 −0.3 −0.2 −0.1 0.0 0.00 0.05 0.10 0.15 0.20 −0.15−0.10−0.050.00 0.05 0.10 −0.15−0.10−0.05 0.00 0.05 0.10−30 −20 −10 0 10 20 −30 −20 −10 0 10 20 −20 −10 0 10 20 −20 −10 0 10 20 −30 −20 −10 0 10 20−20020−0.15−0.10−0.050.000.050.10−30−20−1001020−0.15−0.10−0.050.000.050.10−20−10010200.00.10.2−20020−0.30−0.25−0.20−0.15−0.10−30−20−10010200.00.10.2 SAASL G A M L SS Figure 4.5: Pairwise comparison of the estimated coefﬁcients between GAMLSS and SAASL.Interestingly, Figure 4.4 also reveals that the inability to identify the informative variables results inthe lowest MSE for all three individual coefﬁcients for µ when using FSL (for more numerical details,14 .2 Large Variance with resulting Imbalance between Location and Scale 4 SIMULATION STUDY see Appendix C). As can be seen from Table 4.1, however, the combined additive predictor performsworse in terms of overall MSE than both GAMLSS and SAASL, with the latter performing best.Table 4.1: Summary of the in-sample MSE for each estimation methods, i.e. n (cid:80) ni =1 ( y i − ˆ y i ) .Min. 1st Qu. Median Mean 3rd Qu. Max.FSL 19848 21796 22547 22688 23579 27026GAMLSS 19707 21687 22414 22586 23515 26883SAASL 19679 21663 22372 22554 23443 26883To further highlight the differences in the selection behavior between FSL and SAASL, Figure 4.6illustrates the proportion of boosting iterations used to update µ over the course of the model ﬁts, i.e. p m µ = m µ / ( m µ + m σ ) , where m µ + m σ = m stop . The bimodal distribution for FSL observed in thehistogram in panel (a) demonstrates another problem of the ﬁxed step-lengths in this setting. Consideringthe many estimates equal or close to zero observed in Figure 4.4, the mode close to p m µ = 0 is expected,as it describes the proportion of simulation runs where µ has not been updated at all. However, as soon asat least one base-learner for µ is recognized as an effective model parameter, the small step-length ﬁxedat 0.1 requires a huge number of updates for the base-learner to actually make an impact on the globalloss (hence the large number of maximum iterations allowed for FSL). This results in the second modealso around p m µ = 1 , as the algorithm is mainly occupied with µ in the corresponding runs.This is illustrated by the scatter plot in Figure 4.6(b), where p m µ is plotted against the stoppingiteration m stop . Note that the y-axis is displayed with a logarithmic scale and each tick on the y-axisrepresents a tenfold increase over the previous one. The few points (FSL), whose m stop lie between and , show a better balance between the updates of µ and σ than other points, i.e., the middle regionof p m µ . But we also observe a bimodal distribution for FSL, i.e., lots of points are equal or close to p m µ = 0 and 1, with very low and extremely large values for m stop resulting, respectively.For SAASL, the distribution of p m µ in Figure 4.6(a) is unimodal. The mode smaller than 0.5 indicatesSAASL updates σ a little more frequently than µ . Unlike the cyclical approach that enforces an equalnumber of updates for all distribution parameters, the balance formed by SAASL is more natural. Thisbalance enables the alternate updates between two predictors even though they lie on different scales.Therefore, the information in µ can be fairly discovered in time and it reduces the risk of overlooking theinformative base-learners with respect to µ . The number of simulations runs, in which µ is not updatedat all ( p m µ = 0 ), reduces from 39 in FSL to only 5 in SAASL. Moreover, none of the 100 simulationsrequire a substantial amount of updates for µ to get well estimated coefﬁcients (cf. also Figure 4.4). p m m C oun t FSLSAASL (a) Histogram p m m m s t op FSLSAASL (b) Scatter plot

Figure 4.6: Distribution of p m µ in B = 100 simulation runs. (a) Histogram of p m µ . The histogram ofthe two approaches are overlayed using transparency. (b) Scatter plot of m stop against p m µ . Points andcrosses are displayed with transparency. The y -axis is displayed on a logarithmic scale with base 10.Each tick represents a tenfold increase over the previous one.Table 4.2 displays the information about false positives and false negatives of the two approaches inall 100 simulations with respect to µ and σ . For example, the second and fourth number 77 and 21 in15 APPLICATIONS the ﬁrst line indicate that the informative variable x · is not included in the ﬁnal model in 77 out of 100simulation runs (i.e. false negative), while there are 21 simulations whose ﬁnal model contains the non-informative variable x · (i.e. false positive). Similar as Figure 4.2 in section 4.1, the conservative smallstep-length for µ in FSL increases the number of boosting iterations, but reduces the risk of overﬁtting.Less simulations containing the noise variables for µ in FSL than in SAASL conﬁrms this behavior.According to equation (11) the ASLs ν j ∗ , σ are a sequence of values around 0.05, and (except for thevalues at early boosting iterations) most of them smaller than 0.1. There are correspondingly slightlymore simulations in FSL overﬁtting the σ -submodel than in SAASL.Table 4.2: The number of simulations with false positives and false negatives for each variable underdifferent modelling methods with respect to the two model parameters. The false negatives part showsthe number of simulations in which the informative variables are excluded from the ﬁnal model, andthe false positives part shows how many simulations include the non-informative variables in their ﬁnalmodel. Values are taken at the stopping iteration determined by 10-fold CV.False Negatives False Positives x · x · x · x · x · µ FSL 83 77 81 21 20SAASL 28 24 28 72 73 σ FSL 9 1 6 83 82SAASL 18 1 9 70 67Although non-informative variables of µ are excluded from the FSL model, the informative ones areexcluded as well. Actually µ was not updated in many simulations at all (cf. Figure 4.6(a)). The falsenegatives part of Table 4.2 for µ conﬁrms this. The informative variables x · to x · are excluded from theﬁnal model in the majority of simulations with FSL but not with SAASL. For σ , the smaller step-length ν j ∗ , σ in SAASL selects variables more conservatively and as a consequence slightly more simulationsunderﬁt the σ -submodel in SAASL than in FSL, but the difference is far less pronounced. We apply the proposed algorithms to two datasets. The malnutrition dataset demonstrates the shortcom-ings of FSL and the pitfalls of using numerical determination of ASL with a ﬁxed search interval, andwith the riboﬂavin dataset we illustrate the variable selection properties of each algorithm.

The ﬁrst data called india from the R package gamboostLSS [Hofner et al. (2018); Fahrmeir andKneib (2011)] are sampled from the Standard Demographic and Health Survey between 1998 and 1999on malnutrition of children in India [Fahrmeir and Kneib (2011)]. The sample contains 4000 observa-tions and four variables (BMI of the child (cBMI), age of the child in months (cAge), BMI of the mother(mBMI) and age of the mother in years (mAge)). The outcome of interest in this case is a numeric z-score for malnutrition ranging from -6 to 6, where the negative values represent malnourished children.To highlight the problems of using a ﬁxed step-length, we work with the original variable stunting (cor-responding to 100 * z-score). The identity and logarithm functions are used as the link functions for µ and σ respectively.Because this is not a high-dimensional data example, we use the GAMLSS with penalized maximum-likelihood estimation as a gold standard to examine the effectiveness of the adaptive approaches.Table 5.1 lists the estimated coefﬁcients of each variable on the predictors η µ and η σ at the stoppingiteration tuned by 10-folds CV, where the maximum number of iterations is set to 2000. The estimatedintercept in η σ indicates a large variance of the response, with the setting thus being similar to the secondsimulation above. It is therefore not surprising that FSL selects only one variable (cAge) for η µ , i.e. alarge number of updates for the base-learner are required but the given maximal boosting iteration is not16 .1 Malnutrition of children in India 5 APPLICATIONS large enough. In practice we can certainly increase the maximum number of iterations as well as enlargethe commonly applied step-length 0.1 in order to estimate the coefﬁcients well. But their choices arevery subjective and probably result in tedious manual ﬁne-tuning based on trial and error.The ASL method with the default predeﬁned search interval [0 , encounters a similar problemas FSL. Apart from the only selected and underﬁtted variable cAge for µ , the two variables (cBMIand cAge) for the σ -submodel are also underﬁtted compared with the results from the gold standardGAMLSS. The reason for this phenomenon lies in the relationship between the variance and step-lengthdiscussed in equation (5). The log-link or exponential transformation for η σ in this example data requiresa sequence of huge step-lengths, but the default search interval does not fulﬁll this requirement.An estimation of ASL by increasing its search interval to [0 , , denoted as ASL5 in Table 5.1,results in coefﬁcients comparable to those of GAMLSS. But choosing a suitable search interval becomesan unavoidable side task for ASL when analyzing this kind of dataset.The results of the two semi-analytical approaches hardly differ from the maximum likelihood basedGAMLSS. Unlike the numerical determination with a ﬁxed search interval in ASL, the analytical ap-proaches replace this procedure with a direct and precise solution that gets rid of the potential manualintervention (e.g. increasing the search interval). Contrary to the direct inﬂuence of the variance on ν ∗ [ m ] j ∗ , µ in equation (5), the optimal step-length ν ∗ [ m ] j ∗ , σ is dominated by the chosen base-learner, but as the num-ber of learning iterations increases, such effects gradually disappear, and ν ∗ [ m ] j ∗ , σ ﬁnally converges to 0.5.Thus, our default search interval [0 , is sufﬁcient for ν ∗ [ m ] j ∗ , σ , and increasing the range of search intervalas for ν ∗ [ m ] j ∗ , µ in ASL is almost never necessary.Theoretically, the ASL with a sufﬁciently large search interval (ASL5 in this example) and SAASLshould result in the same values as discussed in the previous theoretical section. Due to the calculationaccuracy of computers and the numerical optimization steps, their outputs are very similar but can differslightly for the malnutrition data.Table 5.1: Comparison of the estimated coefﬁcients.FSL ASL ASL5 SAASL SAASL05 GAMLSS(Intercept) η µ -174.77179 -169.20269 -91.16032 -91.16011 -91.16041 -91.15953 η σ η µ η σ -0.00319 -0.00268 -0.01527 -0.01526 -0.01527 -0.01526cAge η µ -0.03845 -0.37092 -5.84669 -5.84665 -5.84670 -5.84663 η σ -0.00103 -0.00074 0.00284 0.00284 0.00284 0.00284mBMI η µ η σ η µ η σ ν ∗ [ m ] j ∗ , µ and ν ∗ [ m ] j ∗ , σ using SAASL for each variable up to 769boosting iterations speciﬁed by 10-folds CV for one simulation run. Apparently, the optimal step-lengthsfor µ over the entire learning process are over 20000, which is far larger than the ﬁxed step-length 0.1and the upper boundary 10 of the predeﬁned search interval in ASL. Without knowing this information,it is not trivial to determine the search interval for ν ∗ [ m ] j ∗ , µ . And we thus (after acquiring this graphic)re-estimated the example data with ASL5.Additionally, Figure 5.1(5.1b) illustrates the optimal step-length for σ . After several boosting itera-tions the optimal values of each covariate converge to their own stable regions (ranging from about 0.38to 0.56). As discussed above, the optimal step-lengths for σ should be some values around 0.5, and thisgraphic conﬁrms this statement.As this example is not high-dimensional and does not necessarily require variable selection, we canuse GAMLSS with penalized maximum likelihood estimation for comparison. The fact that its results arevery similar to those of the semi-analytical approaches indicates that results from SAASL and SAASL0517 .2 Riboﬂavin dataset 5 APPLICATIONS Iterations S t ep Leng t h Variable cBMIcAgemBMImAge (a) ν ∗ [ m ] j ∗ , µ Iterations S t ep Leng t h Variable cBMIcAgemBMImAge (b) ν ∗ [ m ] j ∗ , σ Figure 5.1: The optimal step-length of each model parameters against the boosting iterations. Up to thestopping iterations speciﬁed by 10-folds CV (here m stop = 769 ), 406 iterations are used to update µ and363 iterations are used to update σ .are reliable. The only alternative to achieve balance between predictors would be using a cyclical algo-rithm (with the downsides discussed in the introduction). Rescaling the response variable or standardizingthe negative partial derivatives could reduce the scaling problem to some extend, but would not eliminatethe need to increase the step-length or reduce the imbalance between predictors. This data set describes the riboﬂavin (also known as vitamin B ) production by Bacillus subtilis, con-taining 71 observations and 4088 predictors (gene expressions) [B¨uhlmann et al. (2014); Dezeure et al.(2015)]. The log-transformed riboﬂavin production rate, which is close to a Gaussian distribution, is re-garded as the response. This data set is chosen to demonstrate the capability of the boosting algorithm todeal with situations in which the number of covariates exceeds the number of observations. Please notethat a comparison to the original GAMLSS algorithm is not possible in this case, since the algorithmis not able to deal with more model parameters than available observations. In order to compare theout-of-sample MSE of each algorithm, we select 10 observations randomly as the validation set.Table 5.2 summarize the selected informative variables for µ and σ separately at the stopping iterationtuned by 5-fold CV, the corresponding coefﬁcients are listed in Appendix D. The results in both tablesdemonstrate the intersection of the selected variables, for example FSL selects 13 informative variablesin total, and 9 of them are also chosen by ASL and SAASL, and there are 11 variables common withSAASL05. In general, for both µ and σ , more variables are included in the adaptive approaches andthe difference in the selected variables mainly lies between the adaptive and ﬁxed approach. Becausethe optimal step-length ν ∗ [ m ] j ∗ , µ lies in the predeﬁned search interval [0 , (and is actually smaller than 1,i.e. the adaptive step-length ν [ m ] j ∗ , µ < . ), and ν ∗ [ m ] j ∗ , σ lies also in a narrower predeﬁned search interval [0 , , ASL and SAASL have the same results. Moreover, as the adaptive step-length is smaller thanthe ﬁxed step-length 0.1, the adaptive approaches make conservative (small) updates, leading to moreboosting iterations. Several of the gene expressions for µ and σ are selected by all algorithms and arethus consistently included in the set of informative covariates. Actually almost all gene expressionschosen by FSL are also recognized as informative variables by all other methods.To compare the performance of each algorithm, Table 5.3 lists the out-of-sample MSE. In contrastto the ﬁxed approach, the three adaptive approaches perform in general well, where the performance ofSAASL05 is slightly worse than the other two. In addition, table 5.3 demonstrates also the result ofLasso estimator from the R package glmnet [Friedman et al. (2010)] suggested by B¨uhlmann et al.(2014). The mean squared prediction error of glmnet is the smallest among the ﬁve approaches, but thedifference with the adaptive approaches is relatively small.As glmnet cannot model the scale parameter σ , only the estimated coefﬁcients of the µ -submodel18 CONCLUSIONS AND OUTLOOK

Table 5.2: Number of chosen variables for η µ and η σ . The diagonal depicts the number per method, theoff-diagonal elements overlapping variables. η µ η σ FSL ASL SAASL SAASL05 FSL ASL SAASL SAASL05FSL 13 9 9 11 16 9 9 12ASL 9 20 20 18 9 17 17 15SAASL 9 20 20 18 9 17 17 15SAASL05 11 18 18 24 12 15 15 24Table 5.3: Out-of-sample MSE.FSL ASL SAASL SAASL05 glmnet

MSE 2.611 1.111 1.111 1.193 0.946are provided in Appendix D. Out of the 21 genes selected by glmnet , 7 and 9 of them are common withthe ASL/SAASL and SAASL05, respectively. The signs (positive/negative) of the estimated coefﬁcientsof these common covariates from glmnet match the adaptive approaches. This comparison indicatesthat the boosted GAMLSS with adaptive step-length is an applicable and competitive approach for high-dimensional data analysis.

The step-length is often not treated as an important tuning parameter in many boosting algorithms, aslong as it is set to a small value. However, if complex models like GAMLSS with several predictorsfor the different distribution parameters are estimated, the different scales of the distribution parameterscan lead to imbalanced updates and resulting bad performances if one common small ﬁxed step-length isused, as we show in this paper.The main contribution of this article is the proposal to use separate adaptive step-lengths for eachdistribution parameter in a non-cyclical boosting algorithm for GAMLSS, which are optimized in eachiteration. In addition to the resulting balance in updates between different distribution parameters, abalance between over- and underﬁtting is obtained by taking only a proportion (shrinkage parameter)such as 10% of the determined optimal step-length as the adaptive step-length. The optimal step-lengthcan be found by optimization procedures such as a line search. We illustrated with an example theimportance of updating the search interval for the search if necessary to ﬁnd the optimal solution.For the important special case of the Gaussian distribution, we derived an analytical solution for theadaptive step-length for the mean parameter µ , which avoids numerical optimization and speciﬁcation ofa search interval. For the scale parameter σ , we obtained an approximate solution of 0.5 (or 0.05 with10% proportion), which gives a better motivated default value than 0.1 relative to the step-length for µ ,and discussed a combination with a one-dimensional line search in the semi-analytical approach.In simulations and empirical applications, we showed favorable behavior compared to using a ﬁxedstep-length FSL. We showed highly competitive results of our adaptive approaches compared to a stan-dard GAMLSS with respect to estimation accuracy for the low-dimensional case, while the adaptiveboosting approach has the advantages of shrinkage and variable selection, which makes it also applicableto the high-dimensional case of more covariates than observations. Overall, the semi-analytical methodfor adaptive step-length selection performed best among the considered methods.In this paper we focused on the special case of the Gaussian distribution to derive analytical or semi-analytical solutions for the optimal step-length. In other cases, a line search has to be conducted forall distribution parameters. In the future, it is worth investigating to derive analytical adaptive step-lengths for other distributions as well, because analytical or approximate adaptive step-lengths increasethe numerical efﬁciency and also reveal the relationships between the optimal step-lengths for different19 EFERENCES REFERENCES parameters and model parameters (as well as properties of commonly used but probably less than idealstep-length settings).Further work should also be the implementation of further (e.g. non-linear, spatial etc.) effects[Hothorn et al. (2011)] into the model, and test the inﬂuence of the adaptive step-length on such effects.Moreover, we discovered correlations between the optimal step-length ν ∗ [ m ] j ∗ , µ of a variable and the coefﬁ-cient of this variable in the σ -submodel through our application of the algorithm. Future work should alsoinvestigate the relationship among the optimal step-lengths of different parameters and the relationshipof these step-lengths to the model coefﬁcients.A basic R package ASL based on this article is available online at https://github.com/FAUBZhang/ASL.This package contains the source code of Algorithm 3 and the function of the corresponding cross-validation. Some simple examples can also be found in this package. This package is originated from theR package gamboostLSS , we hope to implement the functions of

ASL into the latter in the future.

Acknowledgements

The work on this article was supported by the Freigeist-Fellowships of Volkswagen Stiftung, project“Bayesian Boosting - A new approach to data science, unifying two statistical philosophies”. BoyaoZhang performed the present work in partial fulﬁlment of the requirements for obtaining the degree “Dr.rer. biol. hum.” at the Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg (FAU).

References

Brent, R. P. (2013).

Algorithms for minimization without derivatives . Courier Corporation.B¨uhlmann, P. (2006, 04). Boosting for high-dimensional linear models.

The Annals of Statistics 34 (2),559–583.B¨uhlmann, P. and T. Hothorn (2007, 11). Boosting algorithms: Regularization, prediction and modelﬁtting.

Statistical Science 22 (4), 477–505.B¨uhlmann, P., M. Kalisch, and L. Meier (2014). High-dimensional statistics with a view toward applica-tions in biology.

Annual Review of Statistics and Its Application 1 (1), 255–278.B¨uhlmann, P. and B. Yu (2003). Boosting with the L2 loss.

Journal of the American Statistical Associa-tion 98 (462), 324–339.Dezeure, R., P. B¨uhlmann, L. Meier, and N. Meinshausen (2015). High-dimensional inference: Conﬁ-dence intervals, p-values and R-software hdi.

Statistical Science 30 (4), 533–558.Fahrmeir, L. and T. Kneib (2011).

Bayesian smoothing and regression for longitudinal, spatial and eventhistory data . Oxford University Press.Friedman, J., T. Hastie, and R. Tibshirani (2000, 04). Additive logistic regression: a statistical view ofboosting (with discussion and a rejoinder by the authors).

The Annals of Statistics 28 (2), 337–407.Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linear models viacoordinate descent.

Journal of Statistical Software 33 (1), 1–22.Friedman, J. H. (2001, 10). Greedy function approximation: A gradient boosting machine.

Annals ofstatistics 29 (5), 1189–1232.Hepp, T., M. Schmid, O. Gefeller, E. Waldmann, and A. Mayr (2016). Approaches to regularized regres-sion - a comparison between gradient boosting and the Lasso.

Methods of information in medicine 555 , 422–430.Hepp, T., M. Schmid, and A. Mayr (2019). Signiﬁcance tests for boosted location and scale models withlinear base-learners.

The International Journal of Biostatistics 15 (1), 20180110.20

EFERENCES REFERENCES

Hofner, B., L. Boccuto, and M. G¨oker (2015, May). Controlling false discoveries in high-dimensionalsituations: boosting with stability selection.

BMC Bioinformatics 16 (1), 144.Hofner, B., A. Mayr, N. Fenske, and M. Schmid (2018). gamboostLSS: Boosting Methods for GAMLSSModels . R package version 2.0-1.Hofner, B., A. Mayr, N. Robinzonov, and M. Schmid (2014, February). Model-based boosting in R: Ahands-on tutorial using the R package mboost.

Computational Statistics 29 (1-2), 3–35.Hofner, B., A. Mayr, and M. Schmid (2016). gamboostlss: An R package for model building and variableselection in the gamlss framework.

Journal of Statistical Software, Articles 74 (1), 1–31.Hothorn, T., P. Buehlmann, T. Kneib, M. Schmid, and B. Hofner (2010). Model-based boosting 2.0.

Journal of Machine Learning Research 11 , 2109–2113.Hothorn, T., P. Buehlmann, T. Kneib, M. Schmid, and B. Hofner (2018). mboost: Model-Based Boosting .R package version 2.9-1.Hothorn, T., J. M¨uller, B. Schr¨oder, T. Kneib, and R. Brandl (2011). Decomposing environmental, spatial,and spatiotemporal components of species distributions.

Ecological Monographs 81 (2), 329–347.Kohavi, R. et al. (1995). A study of cross-validation and bootstrap for accuracy estimation and modelselection.

Ijcai 14 (2), 1137–1145.Mayr, A., N. Fenske, B. Hofner, T. Kneib, and M. Schmid (2012). Generalized additive models forlocation, scale and shape for high dimensional data—a ﬂexible approach based on boosting.

Journalof the Royal Statistical Society: Series C (Applied Statistics) 61 (3), 403–427.Meinshausen, N. and P. B¨uhlmann (2010). Stability selection.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) 72 (4), 417–473.Rigby, R. A. and D. M. Stasinopoulos (2005). Generalized additive models for location, scale and shape.

Journal of the Royal Statistical Society: Series C (Applied Statistics) 54 (3), 507–554.Ripley, B. D. (2004). Selecting amongst large classes of models.

Methods and Models in Statistics ,155–170.Stasinopoulos, D. and R. Rigby (2007). Generalized additive models for location scale and shape(gamlss) in r.

Journal of Statistical Software, Articles 23 (7), 1–46.Stasinopoulos, D., R. Rigby, G. Heller, V. Voudouris, and F. De Bastiani (2017, 03).

Flexible regressionand smoothing: Using GAMLSS in R . Chapman and Hall/CRC.Thomas, J., A. Mayr, B. Bischl, M. Schmid, A. Smith, and B. Hofner (2018, May). Gradient boostingfor distributional regression: faster tuning and improved variable selection via noncyclical updates.

Statistics and Computing 28 (3), 673–687. 21

APPENDIX A

Appendices

A Derive the analytical ASL for the Gaussian distribution

Take the negative log-likelihood as the loss function, the loss for Gaussian distribution can be displayedas ρ ( y , { η µ , η σ } ) = − log (cid:34) (cid:0) √ π (cid:1) n det (diag (exp ( − η σ ( X )))) ·· exp (cid:18) −

12 ( y − η µ ( X )) T diag (exp ( − η σ ( X ))) ( y − η µ ( X )) (cid:19)(cid:21) = n π ) + Tn η σ ( X ) + 12 ( y − η µ ( X )) T diag (exp ( − η σ ( X ))) ( y − η µ ( X )) . The negative partial derivatives for both distribution parameters in iteration m are then u [ m ] µ = − ∂ρ (cid:16) y , { ˆ η [ m − µ , ˆ η [ m − σ } (cid:17) ∂ ˆ η µ (12) = diag (cid:16) exp (cid:16) − η [ m − σ ( X ) (cid:17)(cid:17) (cid:16) y − ˆ η [ m − µ ( X ) (cid:17) , (13) u [ m ] σ = − ∂ρ (cid:16) y , { ˆ η [ m − µ , ˆ η [ m − σ } (cid:17) ∂ ˆ η σ (14) = − n + diag (cid:18)(cid:16) y − ˆ η [ m − µ ( X ) (cid:17) T (cid:19) diag (cid:16) exp (cid:16) − η [ m − σ ( X ) (cid:17)(cid:17) (cid:16) y − ˆ η [ m − µ ( X ) (cid:17) . (15)Both u [ m ] θ , θ ∈ { µ , σ } can be regressed on the simple linear base-learner h [ m ] j ∗ , θ ( x · j ∗ ) , where j ∗ denotesthe best-ﬁtting variable. u [ m ] µ = ˆ h [ m ] j ∗ , µ ( x · j ∗ ) + ˆ (cid:15) [ m ] µ (16) u [ m ] σ = ˆ h [ m ] j ∗ , σ ( x · j ∗ ) + ˆ (cid:15) [ m ] σ , (17)where ˆ (cid:15) [ m ] µ and ˆ (cid:15) [ m ] σ denote the residuals in simple linear regression models. A.1 Optimal step-length for µ The analytical optimal step-length for µ in iteration m is obtained by minimizing the empirical risk, ν ∗ [ m ] j ∗ , µ = arg min ν n (cid:88) i =1 ρ (cid:16) y i , { ˆ η [ m ] µ ( x i · ) , ˆ η [ m − σ ( x i · ) } (cid:17) = arg min ν n (cid:88) i =1 ρ (cid:16) y i , { ˆ η [ m − µ ( x i · ) + ν ˆ h [ m ] j ∗ , µ ( x ij ∗ ) , ˆ η [ m − σ ( x i · ) } (cid:17) = arg min ν n (cid:88) i =1 − log  √ π exp(ˆ η [ m − σ ( x i · )) exp  − (cid:16) y i − ˆ η [ m − µ ( x i · ) − ν ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) η [ m − σ ( x i · ))  = arg min ν n (cid:88) i =1 

12 log(2 π ) + log(ˆ σ [ m − i ) + (cid:16) y i − ˆ η [ m − µ ( x i · ) − ν ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) σ m − i  = arg min ν n (cid:88) i =1 (cid:16) y i − ˆ η [ m − µ ( x i · ) − ν ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) σ m − i , .1 Optimal step-length for µ A APPENDIX A

Note that the expression ˆ σ m − i represents the square of the standard deviation in the previous boostingiteration, i.e. ˆ σ m − i = (ˆ σ [ m − i ) . And according to the model speciﬁcation ˆ σ [ m − i = exp(ˆ η [ m − σ ( x i · )) .It can be shown, that the expression is a convex function, so the optimal value ν ∗ [ m ] µ is accessed byletting the ﬁrst order derivative equal zero, ∂∂ν n (cid:88) i =1 (cid:16) y i − ˆ η [ m − µ ( x i · ) − ν ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) σ m − iEq. (13) = ∂∂ν n (cid:88) i =1 (cid:16) u [ m ] µ ,i ˆ σ m − i − ν ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) σ m − i = ∂∂ν n (cid:88) i =1  u m ] µ ,i ˆ σ m − i − ν ˆ h [ m ] j ∗ , µ ( x ij ∗ ) u [ m ] µ ,i + ν (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) σ m − i  = n (cid:88) i =1  − ˆ h [ m ] j ∗ , µ ( x ij ∗ ) + ν (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) ˆ σ m − i  ! = 0 ⇔ ν = (cid:80) ni =1 ˆ h [ m ] j ∗ , µ ( x ij ∗ ) u [ m ] µ ,i (cid:80) ni =1 (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) ˆ σ m − i Eq. (16) = (cid:80) ni =1 ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) + ˆ (cid:15) [ m ] µ ,i (cid:17)(cid:80) ni =1 (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) ˆ σ m − i = (cid:80) ni =1 (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) + (cid:80) ni =1 ˆ h [ m ] j ∗ , µ ( x ij ∗ )ˆ (cid:15) µ ,i (cid:80) ni =1 (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) ˆ σ m − i = (cid:80) ni =1 (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) (cid:80) ni =1 (cid:16) ˆ h [ m ] j ∗ , µ ( x ij ∗ ) (cid:17) ˆ σ m − i , where (cid:80) ni =1 ˆ h [ m ] j ∗ µ ( x ij ∗ )ˆ (cid:15) µ ,i = 0 , because the residuals are uncorrelated with the ﬁtted values.23 .2 Optimal step-length for σ A APPENDIX A

A.2 Optimal step-length for σ The analytical optimal step-length for σ in iteration m is obtained by minimizing the empirical risk, ν ∗ [ m ] σ = arg min ν n (cid:88) i =1 ρ (cid:16) y i , { ˆ η [ m − µ ( x i · ) , ˆ η [ m ] σ ( x i · ) } (cid:17) = arg min ν n (cid:88) i =1 ρ (cid:16) y i , { ˆ η [ m − µ ( x i · ) , ˆ η [ m − σ ( x i · ) + ν ˆ h [ m ] j ∗ , σ ( x ij ∗ ) } (cid:17) = arg min ν n (cid:88) i =1 − log  √ π exp (cid:16) ˆ η [ m − σ ( x i · ) + ν ˆ h [ m ] j ∗ , σ ( x ij ∗ ) (cid:17) ·· exp  − (cid:16) y i − ˆ η [ m − µ ( x i · ) (cid:17) (cid:16) η [ m − σ ( x i · ) + 2 ν ˆ h [ m ] j ∗ , σ ( x ij ∗ ) (cid:17)  = arg min ν n (cid:88) i =1

12 log(2 π ) + n (cid:88) i =1 (cid:16) ˆ η [ m − σ ( x i · ) + ν ˆ h [ m ] j ∗ , σ ( x ij ∗ ) (cid:17) + n (cid:88) i =1 (cid:16) y i − ˆ η [ m − µ ( x i · ) (cid:17) (cid:16) η [ m − σ ( x i · ) + 2 ν ˆ h [ m ] j ∗ , σ ( x ij ∗ ) (cid:17) = arg min ν n (cid:88) i =1 (cid:16) ˆ η [ m − σ ( x i · ) + ν ˆ h [ m ] j ∗ , σ ( x ij ∗ ) (cid:17) + n (cid:88) i =1 (cid:16) y i − ˆ η [ m − µ ( x i · ) (cid:17) (cid:16) η [ m − σ ( x i · ) + 2 ν ˆ h [ m ] j ∗ , σ ( x ij ∗ ) (cid:17) . It can be shown, that the second order derivative of the expression is positive and thus the expression aconvex function. Letting the ﬁrst order derivative equal zero, we get ∂∂ν  n (cid:88) i =1 (cid:16) ˆ η [ m − σ ( x i · ) + ν ˆ h [ m ] j ∗ , σ ( x ij ∗ ) (cid:17) + n (cid:88) i =1 (cid:16) y i − ˆ η [ m − µ ( x i · ) (cid:17) (cid:16) η [ m − σ ( x i · ) + 2 ν ˆ h [ m ] j ∗ , σ ( x ij ∗ ) (cid:17)  = n (cid:88) i =1 ˆ h [ m ] j ∗ , σ ( x ij ∗ ) − n (cid:88) i =1 (cid:16) y i − ˆ η [ m − µ ( x i · ) (cid:17) ˆ h [ m ] j ∗ , σ ( x ij ∗ ) exp (cid:16) − η [ m − σ ( x i · ) − ν ˆ h [ m ] σ ( x ij ∗ ) (cid:17) Eq. (15) = n (cid:88) i =1 ˆ h [ m ] j ∗ , σ ( x ij ∗ ) − n (cid:88) i =1 u [ m ] σ ,i + 1exp (cid:16) − η [ m − σ ( x i · ) (cid:17) ˆ h [ m ] j ∗ , σ ( x ij ∗ ) exp (cid:16) − η [ m − σ ( x i · ) − ν ˆ h [ m ] j ∗ , σ ( x ij ∗ ) (cid:17) = n (cid:88) i =1 ˆ h [ m ] j ∗ , σ ( x ij ∗ ) − n (cid:88) i =1 (cid:16) u [ m ] σ ,i + 1 (cid:17) ˆ h [ m ] j ∗ , σ ( x ij ∗ ) exp (cid:16) − ν ˆ h [ m ] j ∗ , σ ( x ij ∗ ) (cid:17) Eq. (17) = n (cid:88) i =1 ˆ h [ m ] j ∗ , σ ( x ij ∗ ) − n (cid:88) i =1 (cid:16) ˆ h [ m ] j ∗ , σ ( x ij ∗ ) + ˆ (cid:15) [ m ] σ ,i + 1 (cid:17) ˆ h [ m ] j ∗ , σ ( x ij ∗ )exp (cid:16) ν ˆ h [ m ] j ∗ , σ ( x ij ∗ ) (cid:17) ! = 0 APPENDIX B

B Further graphics of section 4.1

FSL ASL SAASL SAASL05 p n - i n f = n - i n f = n - i n f = n - i n f = x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4−1012−1012−1012−1012 Variable E s t i m a t ed c oe ff i c i en t Figure B.1: Boxplot of the estimated coefﬁcients of η µ in 100 simulation runs. Values are takenat the stopping iterations determined by 10-folds cross-validation. The results are separated accord-ing to ﬁxed and adaptive approaches with respect to different non-informative variables settings, i.e. p n-inf = 0 , , and . The horizontal red lines indicate the true coefﬁcients. The shrinkage of thecoefﬁcients towards zero can be observed from this graphic.25 APPENDIX B

FSL ASL SAASL SAASL05 p n - i n f = n - i n f = n - i n f = n - i n f = x3 x4 x5 x6 x3 x4 x5 x6 x3 x4 x5 x6 x3 x4 x5 x6−0.40.00.4−0.40.00.4−0.40.00.4−0.40.00.4 Variable E s t i m a t ed c oe ff i c i en t Figure B.2: Boxplot of the estimated coefﬁcients of η σ in 100 simulation runs. Values are taken at thestopping iterations tuned by 10-folds cross-validation. The results are separated according to ﬁxed andadaptive approaches with respect to different non-informative variables settings, i.e. p n-inf = 0 , , and . The horizontal red lines indicate the true coefﬁcients. The shrinkage of the coefﬁcients towardszero can be observed from this graphic. 26 APPENDIX B p n - inf = 250 p n - inf = 500p n - inf = 0 p n - inf = 50FSL ASL SAASL SAASL05 FSL ASL SAASL SAASL05600650700750600650700750 N ega t i v e l og − li k e li hood Figure B.3: Summary of the negative log-likelihood of 100 simulation runs with different estimatingapproaches with respect to various non-informative variables settings. Values are taken at the stoppingiteration determined by 10-folds cross-validation. p n - inf = 250 p n - inf = 500p n - inf = 0 p n - inf = 50FSL ASL SAASL SAASL05 FSL ASL SAASL SAASL0525050075010002505007501000 m s t op Figure B.4: m stop tuned by 10-fold CV with different estimating methods with respect to different non-informative variables settings. The predeﬁned maximal learning iteration is 1000.27 APPENDIX C

C Further table of section 4.2

Table C.1: The average MSE of the estimated coefﬁcients for both model parameters µ and σ w.r.t. threeestimation approaches. The MSE for each coefﬁcient is calculated not only from 100 simulation runs(Total) at their stopping iterations but also from the true positive subsets (TP), i.e., the simulations fromwhich a coefﬁcient is selected by all three approaches. µ σ ˆ β ˆ β ˆ β ˆ β ˆ β ˆ β Total TP Total TP Total TP Total TP Total TP Total TPFSL 13.7 80.9 26.5 112.2 13.8 76.0 0.84 0.81 1.41 1.42 0.84 0.82GAMLSS 113.8 328.8 145.6 355.8 116.8 339.4 0.82 0.79 1.44 1.44 0.81 0.79SAASL 71.3 250.2 95.8 271.7 73.3 238.5 0.85 0.81 1.39 1.40 0.85 0.8228

APPENDIX D

D Estimated coefﬁcients of riboﬂavin dataset

Table D.1: The estimated coefﬁcients of the µ -submodel with ﬁxed and adaptive approaches. Values aretaken at the m stop tuned by 5-folds CV.Variable FSL ASL SAASL SAASL05 glmnet1 (Intercept) -7.03 -7.04 -7.04 -7.03 0.722 ARGF at -0.08 -0.02 -0.02 -0.023 IOLE at -0.32 -0.024 LYSC at -0.065 RPLO at -0.02 -0.086 SPOIISA at 0.35 0.19 0.19 0.23 0.027 XKDC at 0.18 0.09 0.198 XKDO at 0.02 0.02 0.039 XKDS at 0.11 0.08 0.08 0.08 0.0610 XLYA at 0.0511 XTMA at 0.03 0.03 0.0112 XTRA at 0.0113 YCDH at -0.0314 YCGM at -0.09 -0.06 -0.06 -0.06 -0.0115 YCGN at -0.04 -0.04 -0.0416 YCGO at -0.07 -0.1417 YCGP at -0.0318 YCKE at 0.15 0.12 0.12 0.14 0.1519 YCLB at 0.2920 YCSG at -0.08 -0.08 -0.1821 YDAO at -0.0322 YDAR at -0.24 -0.16 -0.16 -0.1623 YDDK at -0.0424 YEBC at -0.5525 YHAI at 0.12 0.12 0.1126 YHFU at -0.02 -0.02 -0.03 -0.0127 YJCJ at 0.11 0.04 0.04 0.0428 YKBA at 0.0129 YKUH at 0.05 0.05 0.0630 YOAB at -0.3431 YORB i at 0.03 0.03 0.05 0.1032 YOZH i at 0.02 0.0233 YPGA at -0.0534 YTGB at -0.0935 YWQD at -0.0236 YXJA at -0.01 -0.01 -0.0137 YXLC at -0.03 -0.0338 YXLD at -0.12 -0.14 -0.14 -0.16 -0.1439 YXLE at -0.06 -0.01 -0.01 -0.0129 APPENDIX D

Table D.2: The estimated coefﬁcients of σ -submodel with ﬁxed and adaptive approaches. Values aretaken at the m stopstop