Fast marginal likelihood estimation of penalties for group-adaptive elastic net
FFast marginal likelihood estimation of penalties forgroup-adaptive elastic net
Mirrelijn M. van Nee ∗ , Tim van de Brug , and Mark A. van de Wiel Epidemiology and Data Science, Amsterdam University Medical Centers, The Netherlands MRC Biostatistics Unit, Cambridge University, UK
Abstract
Nowadays, clinical research routinely uses omics data, such as gene expression, forpredicting clinical outcomes or selecting markers. Additionally, so-called co-data are of-ten available, providing complementary information on the covariates, like p-values frompreviously published studies or groups of genes corresponding to pathways. Elastic netpenalisation is widely used for prediction and covariate selection. Group-adaptive elasticnet penalisation learns from co-data to improve the prediction and covariate selection, bypenalising important groups of covariates less than other groups. Existing methods are,however, computationally expensive. Here we present a fast method for marginal likeli-hood estimation of group-adaptive elastic net penalties for generalised linear models. Wefirst derive a low-dimensional representation of the Taylor approximation of the marginallikelihood and its first derivative for group-adaptive ridge penalties, to efficiently estimatethese penalties. Then we show by using asymptotic normality of the linear predictors thatthe marginal likelihood for elastic net models may be approximated well by the marginallikelihood for ridge models. The ridge group penalties are then transformed to elasticnet group penalties by using the variance function. The method allows for overlappinggroups and unpenalised variables. We demonstrate the method in a model-based simu-lation study and an application to cancer genomics. The method substantially decreasescomputation time and outperforms or matches other methods by learning from co-data.
Clinical studies increasingly involve the analysis of omics data, like gene expression, metabolomicsand radiomics. The research goal may be to predict some outcome, like therapy response ordiagnosis, or selecting markers for follow-up study. In addition to the main data with informa-tion on the samples, so-called co-data may be available, providing complementary informationon the covariates. In cancer genomics, for example, genes are grouped according to pathwaysor gene ontology, and p-values may be derived from previous studies available in public repos-itories like The Cancer Genome Atlas (Tomczak et al., 2015). The predictions and covariateselection may improve by learning from co-data.A popular approach to prediction and covariate selection is elastic net penalisation (Zou andHastie, 2005), as it simultaneously estimates and selects covariates. Recent work includes co-data by allowing for differential group penalties, penalising informative groups of covariates lessthan non-informative ones. The method fwelnet (Tay et al., 2020) extends use of groupedco-data to continuous co-data (termed features of features). The method may be used forgrouped co-data as well, for which it is shown to correspond to an elastic net penalty onthe group level, with the amount of penalisation governed by one global penalty parameter.Though fast, this method may not be flexible enough when groups differ largely in prediction ∗ The first author is supported by ZonMw TOP grant COMPUTE CANCER (40- 00812-98-16012). a r X i v : . [ s t a t . M E ] J a n trength. The method ipflasso (Boulesteix et al., 2017) selects the best group penalties froma proposed grid of possible values and is therefore able to flexibly adapt group penalties. Ingeneral, however, it is unclear what values should be proposed and only a finite set of valuesmay be proposed. Moreover, the method is not scalable in the number of groups as the numberof possible grid configurations increases exponentially. The method gren (M¨unch et al., 2019),for group-adaptive elastic net, uses a variational Bayes algorithm to compute empirical Bayesestimates for the group penalties in logistic regression. The method is group-adaptive andscalable in the number of groups, but not in the number of covariates. Moreover, it is onlyimplemented for binary response. Note that group-adaptive methods are very different innature than (variations of) the group lasso (Meier et al., 2008). The latter selects entiregroups. While this may be useful when there are many small groups, its limited flexibility interms of penalisation may render inferior predictive performance for settings with a limitednumber of groups (M¨unch et al., 2019).Here we propose a fast and efficient method for marginal likelihood estimation of groupelastic net penalties. The method is group-adaptive and may be used for elastic net penalisedgeneralised linear models in high-dimensional data. We include details for linear and logisticregression. Groups may be overlapping and the method allows for unpenalised covariates,such as an intercept or clinical covariates like age and sex. First, we derive a low-dimensionalrepresentation of the marginal likelihood for ridge models, a special case of elastic net. Then weshow by using the asymptotic multivariate normality of the linear predictor that this marginallikelihood also approximates the marginal likelihood of elastic net models well. Lastly, we showhow the ridge group penalties are transformed to find optimal marginal likelihood elastic netgroup penalties.The outline of the paper is as follows. Section 2 includes details of the method. Section3 first compares performance and computation times of several methods in a linear regressionmodel-based simulation study. After, the method is illustrated in the logistic regression settingon a cancer genomics data example. Section 4 then concludes and discusses the method andresults. Let the response be given by Y ∈ R n and the observed high-dimensional data, p > n , by X ∈ R n × p . Let the co-data matrix Z ∈ R p × G contain group membership information for G groups as defined in (van Nee et al., 2020). First assume that groups are not overlapping.Below, we show how the method may be used for partly overlapping groups too. Let theobserved data for group g be given by X g . We model the response with a generalised linearmodel, and impose a group-regularised elastic net prior on the regression coefficients β ∈ R p ,with group penalties λ ∈ R G : Y i | X i , β ind. ∼ π ( Y i | X i , β ) , µ i := E Y i | X i , β ( Y i ) = g − ( X i β ) , i = 1 , .., n,π ( β k | α, λ g k ) ∝ exp (cid:0) − λ g k ( α | β k | + (1 − α ) / β k ) (cid:1) , k = 1 , .., p, (1)with g ( · ) the link function corresponding to the type of generalised linear model used, Var( y i ) = φV ( µ i ) for some known variance function V ( · ) and scale parameter φ , and λ g k the group-specificpenalty of group g to which covariate k belongs.We use an empirical Bayes approach and estimate the group penalties for a given α toarrive at the group-adaptive regularised elastic net estimate for the regression coefficients:ˆ λ = argmax λ π ( Y | λ , α ) = argmax λ (cid:90) β π ( Y | β ) π ( β | λ , α )d β , (2)ˆ β = argmax β (cid:110) log( π ( Y | β )) + log( π ( β | ˆ λ , α )) (cid:111) . (3)The data X possibly contain some unpenalised variables next to the penalised variables, e.g.for an intercept, which we will refer to by X unpen ∈ R n × p and X pen ∈ R n × p respectively,2 = p + p . In that case, only the penalised variables are integrated out in Equation (2). Werefer to the unpenalised and penalised regression coefficients by β unpen and β pen respectively. First, consider group-regularised ridge models ( α = 0). Wood (2011) derives a Laplace ap-proximation for the marginal likelihood for semiparametric generalised linear models when theprior distribution is of the form: β ∼ N ( , φS − ) , S = (cid:88) g λ g S g , (4)with S − a generalised inverse and S the ridge penalty matrix equal to the weighted sum over G different known penalty matrices S g . The R -package mgcv implementing this approximationis developed for low-dimensional data and does not allow for high-dimensional data. In theory,the results include the case of ridge models in high-dimensional data as well, i.e. define S g = I g ∈ R p × p , with I g the diagonal matrix with the k th diagonal element equal to 1 if covariate k belongs to group g and 0 otherwise. In practice, however, the Laplace approximation maybe inaccurate for high-dimensional integrals (van de Wiel et al., 2019) and is computationallyexpensive due to the large dimension of p .For ridge models, we may rewrite the high, p -dimensional integral as a lower, n -dimensionalintegral by observing that the likelihood only depends on β via the linear predictors η = X β (Veerman et al., 2019). The resulting prior distribution for η pen = X pen β pen ∈ R n is again amultivariate normal distribution: η pen | λ , φ ∼ N (cid:0) , X pen φ Λ − pen X Tpen (cid:1) = N (cid:32) , φ (cid:88) g λ − g X g X Tg (cid:33) , (5)with Λ pen ∈ R p × p denoting the diagonal ridge penalty matrix for the penalised variables.Note that in general, the penalty matrix ( (cid:80) g λ − g X g X Tg ) − cannot be written in the weightedcombination of G penalty matrices as in Equation (4) when G >
1. Moreover, when unpenalisedvariables are included, the dimension of ( β Tunpen , η Tpen ) T is still larger than the number ofsamples, preventing straightforwardly using the existing software for including only one groupof high-dimensional data too.Here, we derive a low-dimensional representation of the high-dimensional Laplace approx-imated marginal likelihood and its first derivative as derived by (Wood, 2011) that may becomputed efficiently for multiple groups and allows for inclusion of unpenalised variables.Recently, it was shown that the maximum likelihood estimate for the linear predictors,ˆ η , may be obtained efficiently by rewriting the steps of the iterative weighted least squaresalgorithm (IWLS) in n -dimensional terms (van de Wiel et al., 2020). We use similar results torewrite the Laplace approximation and its first derivative in low-dimensional terms. We thenuse a general purpose optimiser to optimise the marginal likelihood for group-regularised ridgemodels. Below we state the results, details are given in Appendix A.The Laplace approximation of the minus log marginal likelihood for group-regularised ridgemodels is: − (cid:96) ( λ , φ ) ≈ − (cid:96) (ˆ η , φ ) + 12 φ ( y − µ ) T ˆ η −
12 log( | I n × n − W H pen | ) , (6)with (cid:96) ( c ) denoting the log likelihood given parameters c , I n × n ∈ R n × n the identity matrix, W ∈ R n × n the weight matrix in IWLS, and H pen the hat matrix for the penalised variablesonly: H pen = X pen ( X Tpen
W X pen + Λ pen ) − X Tpen , (7)3ith Λ pen the diagonal penalty matrix containing the group penalties. We use the efficientexpression of the n × n -dimensional hat matrix as derived in (van de Wiel et al., 2020).We state the derivative in terms of ρ g = log( λ g ). First we define some useful expressions.Let I g ∈ R p × p denote the diagonal matrix with diagonal element k equal to 1 if covariate k belongs to group g and 0 otherwise. Denote the full hat matrix by H , i.e. as in Equation(7) but with X pen substituted by X , and denote the contribution of the g th group to the hatmatrix by H g := X ( X T W X + Λ) − I g X T , with the efficient lower-dimensional representationgiven in Appendix A. The partial derivatives of the Laplace approximation of the log marginallikelihood to the group parameters are given by, g = 1 , .., G : ∂ − (cid:96) ( ρ , φ ) ∂ρ g = 1 φ ( y − µ ) T V − G (cid:48)− [ I n × n − HW ] H Tg ( y − µ + W ˆ η ) − φ (cid:0) − G (cid:48)− ˆ η + y − µ (cid:1) T [ I n × n − HW ] H Tg ( y − µ + W ˆ η )+ 12 tr (cid:18) H pen ∂W∂ρ g (cid:19) − tr exp( − ρ g ) X g X Tg (cid:32) W − + G (cid:88) g =1 λ − g X g X Tg (cid:33) − . (8)where V ∈ R n × n is the diagonal matrix with diagonal elements V ( µ i ), G (cid:48) is the diagonalmatrix with diagonal elements g (cid:48) ( µ i ), and the partial derivative ∂W∂ρ g readily obtained from(Wood, 2011). The partial derivatives for λ g are obtained by multiplying the derivative to ρ g by λ g .We optimise φ jointly with the group parameters. As ˆ η does not depend on φ , the partialderivative is given by: ∂ − (cid:96) ( ρ , φ ) ∂φ = ∂ − (cid:96) (ˆ η , φ ) ∂φ − φ ( y − µ ) T ˆ η . (9)The partial derivative of the log likelihood depends on the type of glm considered and isknown for canonical models. We include details for linear regression in Appendix A. Forlogistic regression, φ = 1 is constant and the partial derivative is 0. Next, we describe how we use the ridge penalty estimates to obtain elastic net penalties with α ∈ [0 , · pen for notational convenience. Denote the ridge penalty parameters and penaltymatrix by λ R and Λ R respectively. Define the variance function of the elastic net prior distri-bution for a given α ∈ [0 ,
1] as h ( · ): h ( λ g k ) := Var β | α, λ ( β k ) . (10)Note that for each λ g k , g k ∈ { , .., G } , there exist φ, λ R such that h ( λ g k ) = φλ − R,g .The prior distribution for η = X β is not analytical for α >
0. An important observation isthat we can exploit the high-dimensionality of the data to approximate the prior for η with amultivariate normal distribution. The following theorem follows from the multivariate centrallimit theorem for linear random vector forms in (Eicker, 1966): Theorem 1.
Suppose β j ind. ∼ π g j ( β j ) for j = 1 , . . . , p and group-specific prior π g j ( · ) , with E ( β j ) = 0 and Var( β j ) = τ g j ∈ (0 , ∞ ) . For g = 1 , . . . , G , let G g be the group size andlet X g ∈ R n × G g be the weights corresponding to group g . Let x ∗ j denote the j th column of X ∈ R n × p . Suppose x ∗ j (cid:54) = ∈ R n for all j , rank ( X ) = n for all p , and for p → ∞ , max j =1 ,...,p x T ∗ j (cid:0) XX T (cid:1) − x ∗ j → . (11)4 hen, for fixed G , fixed n , and p → ∞ , (cid:16)(cid:80) g τ g X g X Tg (cid:17) − / X β d → N (0 , I n × n ) , (12) where I n × n is the ( n × n ) -dimensional identity matrix and (cid:0) (cid:80) g τ g X g X Tg (cid:1) − / is the inverseof the unique positive definite square root of (cid:80) g τ g X g X Tg . If n = 1 then condition (11) is equivalent tomax j =1 ,...,p x j (cid:46) (cid:80) pj =1 x j → , for p → ∞ . Informally, condition (11) can be interpreted as each variable being asymptoticallynegligible in size compared to the full data set. In practice, this condition will be reasonablefor most omics, especially as omics data are often standardised, but counter examples mayexist such as mutation data with rare mutations.Hence, the prior on η under an elastic net prior on β may be approximated by the followingmultivariate normal distribution: π ( η | α, λ ) ≈ N (cid:32) , (cid:88) g h ( λ g ) X g X Tg (cid:33) = N (cid:32) , (cid:88) g φλ − R,g X g X Tg (cid:33) = π ( η | φ, λ R ) . (13)Then the marginal likelihood may be approximated as follows: π ( Y | φ, α, λ ) = (cid:90) η π ( Y | η , φ ) π ( η | λ , α )d η ≈ (cid:90) η π ( Y | η , φ ) π ( η | λ R , φ )d η = π ( Y | φ, λ R ) , (14)where the latter expression is efficiently computed using Equation (6). Hence, the marginallikelihood of group-regularised elastic net models is approximately the same as that of ridgemodels, but parametrised differently. The partial derivatives from the elastic net parametrisa-tion may then be obtained from the partial derivatives for the ridge parametrisation as givenin Equations (8) and (9) by using the chain rule and a change of variables.Now, one could again use a general purpose optimiser to maximise the marginal likelihoodgiven in Equation (14) for the elastic net parametrisation. We use a more direct approach andtransform the marginal likelihood estimates for the ridge parametrisation to the elastic netparametrisation using the known variance function. As the marginal likelihood of the ridgeand elastic net model are approximately equal, the maximal value, obtained in the transformedmaximiser, is also approximately equal. So, the elastic net estimates are given by:ˆ λ = h − (cid:16) ˆ φ ˆ λ − R (cid:17) , (15)where h − ( · ) is applied element-wise. The proposed approach has the advantage that, oncethe optimal ridge penalties are obtained, the optimal elastic net penalties are quickly obtainedfor a whole range of possible α values. Figure 1 illustrates this for elastic net penalties.The variance function for the elastic net prior for a given α can be shown to be given by: h ( λ g ) = Var β | α,λ g ( β k ) = λ − g (1 − α ) − − αλ / g (1 − α ) / ϕ (cid:18) λ g α (1 − α ) (cid:19) − Φ (cid:18) λ g α (1 − α ) (cid:19) + α (1 − α ) , (16)with Φ( · ) and ϕ ( · ) the standard normal cumulative density function and probability densityfunction respectively.The expression simplifies for lasso ( α = 1) to h ( λ g ) = λ g , and is easily solved analytically.For 0 < α <
1, we use the root-finding algorithm of the R -function uniroot to find the roots5 .11.010.0 0.00 0.25 0.50 0.75 1.00 Alpha E l a s t i c ne t pena l t y −2−1012 log(ridge penalty) Alpha E l a s t i c ne t pena l t y −20−1001020 log(ridge penalty) Figure 1:
Transformed elastic net penalty for various α and ridge penalty. of h ( ˆ φ ˆ λ − R,g ) − λ g = 0. This suffices for values of 10 − < λ g < . The evaluation of thevariance function is numerically unstable for more extreme values of λ g . Therefore we truncatethe values to either the ridge or lasso estimate while maintaining the monotonicity for moreextreme λ g , as illustrated in Figure 1. We expect that this fix will suffice as the precise absolutevalue has less impact in the extremes. So far, we have assumed the groups to be non-overlapping. A simple way to allow for partlyoverlapping groups is to make artifical, non-overlapping groups, similarly as proposed as naiveimplementation of the latent overlapping group lasso (Jacob et al., 2009).First consider group-regularised ridge models, for which we define τ = φ λ − R . We modelthe prior variance by the average over multiple groups, given in Z τ , as used in (van Nee et al.,2020). Let ¯ X denote the observed data matrix where each column k of the original matrix X is duplicated I k times for each of the I k groups covariate k belongs to, with group indices givenin I k . Let ¯ X (cid:48) denote the matrix where the columns are additionally scaled by √ I k . Define theextended vector of artificial regression coefficients by ¯ β and the extended vector with scaledartificial regression coefficients by ¯ β (cid:48) . Each column duplicated from the original column k nowcorresponds to an artificial, independent effect β k g = √ I k β (cid:48) k g ind. ∼ N (0 , τ g /I k ) for k = 1 , .., p, and g ∈ I k . The effect of a covariate k is equal to the sum of the contributions of the groups, β k = (cid:80) g ∈I k β k g = √ I k (cid:80) g ∈I k β (cid:48) k g , such that X β = ¯ X ¯ β = ¯ X (cid:48) ¯ β (cid:48) . The prior distribution of β (cid:48) k g is given by N (0 , τ g ) = N (0 , φλ R,g ). Hence, we can use our proposed method on the scaled,duplicated ¯ X (cid:48) to obtain the estimates for the prior parameters φ, λ R . Finally, the varianceestimates are pooled by the co-data matrix Z to compute β with ridge prior variances Zφ λ − R .For elastic net models, we first transform the ridge prior variances given in Zφ λ − R to elasticnet penalties as described above.The high-dimension of ¯ X increases fast for largely overlapping groups. In practice, however,it is not necessary to store this matrix, nor will it increase the computational time as fast, asthe computations for the marginal likelihood only require the high-dimensional computationof G n × n -dimensional matrices ¯ X (cid:48) g ¯ X (cid:48) gT once. Note that one should take care in includinghighly overlapping groups as this results in highly correlated groups. In particular for non-linear models, like logistic regression, we experienced that the components λ g of λ were estimated well in a relative sense, but less so in the absolute sense. This may bedue to the Laplace approximation of the likelihood, which is rather coarse for binary outcomes,while being exact for the linear model. Therefore, the predictive performance may benefit fromusing recalibrated group-penalties: λ (cid:48) = λ λ , where λ are the estimated penalty factors, and6 is a global rescaling penalty. As λ is a scalar, it is efficiently and easily estimated bycross-validation, e.g. by using glmnet with penalty factors λ . The method is termed squeezy as it squeezes out some sparsity from dense group-adaptiveridge models. We conduct a model-based linear regression simulation study and illustratethe method on a cancer genomics example in the logistic regression setting. We include thefollowing elastic net models to compare performance and computation time:i) EN ( glmnet , Friedman et al. (2010)): a co-data agnostic elastic net penalty, with globalpenalty parameter obtained by cross-validation;ii) fwEN and fwEN (continuous) ( fwelnet , Tay et al. (2020)): a globally adaptive elasticnet penalty on group level for grouped data ( fwEN ) or elastic net penalty with weightsa function of continuous co-data ( fwEN (continuous) ). Note that we include fwelnet(continuous) only in the cancer genomics data example for which continuous co-datais available;iii) ipf and ipf2 ( ipflasso , Boulesteix et al. (2017)): a group-adaptive elastic net penalty,with the group penalty factors selected from a grid of possible values. We take the gridwhere each penalty factor is in { , } ( ipf ) or in { , , } ( ipf2 ). Note that we onlyinclude the computationally expensive ipf2 in the model-based simulated data examplefor comparison of computing times;iv) gren ( gren , M¨unch et al. (2019)): a group-adaptive elastic net penalty, with the grouppenalty factors obtained by an approximate empirical-variational Bayes framework. Notethat gren is not included in the first linear regression example, as it is not implementedfor the linear model;v) ecpcEN squeezy ( ecpc , van Nee et al. (2020) and squeezy ): a group-adaptive elasticnet penalty. The method ecpc provides empirical Bayes moment estimates for group-adaptive ridge penalties. We use squeezy to transform these to elastic net penaltiescombined with a recalibrated global rescaling penalty;vi) squeezy (single) and squeezy (single+reCV) : a co-data agnostic elastic net penalty,where the global penalty is obtained by transforming the marginal likelihood estimate forthe ridge penalty to an elastic net penalty, without or with recalibration of a global rescal-ing penalty. While glmnet internally standardises linear response, squeezy (single+reCV) does not;vii) squeezy (multi) and squeezy (multi+reCV) : a group-adaptive elastic net penalty us-ing the proposed method, without or with recalibration of a global rescaling penalty. We simulate training and test sets of observed data X in 10 correlated blocks of size p/ α = 1) with mean zero and group-specificscale parameter b g k , g k ∈ { , .., } , for 5 equally sized groups, and response from a normaldistribution: x i,b ind. ∼ N (cid:18) , I ( p/ × ( p/ + ρ (1 − ρ ) ( p/ × ( p/ (cid:19) , i = 1 , .., n, b = 1 , .., ,β k ind. ∼ Laplace (0 , b k g ) , k = 1 , .., p, Y ∼ N ( X β , σ I n × n ) , with ρ = 0 . σ = 2. We consider three settings of group parameters: i) no groups:the group scale parameters b are the same and chosen such that the variance under eachgroup prior is equal to (0 . , . , . , . , . · p ; ii) weakly informative groups: the groupscale parameters are different, but less so than in the third setting. The variances match(0 . , . , . , . , . · p ; iii) informative groups: groups are different, with vari-ances matching (0 . , . , . , . , . · p . 7 Alpha M SE reCV TRUEFALSE
Figure 2:
Model-based simulation study, informative groups setting. MSE performance of squeezy (multi) and squeezy (multi+reCV) on the test sets of pairs of training and test sets for evenly spaced valuesof α ∈ [0 , , shown in pointwise quantiles of (0 . , . , . , . , . . No groups Weakly informative groups Informative groups1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 820030040050030040050090120150180
Method M SE Method
1. EN2. fwEN 3. ipf4. ecpcEN squeezy 5. squeezy (single)6. squeezy (multi) 7. squeezy (single+reCV)8. squeezy (multi+reCV)
Figure 3:
Model-based simulation study. Box plots of the MSE performance on the test sets of pairs oftraining and test sets, α = 0 . . First, we fit the methods listed above for α = 0 . n = 150, p = 600 and the co-data providing group membership of the G = 5 groups, tocompare performance. Our method squeezy easily obtains optimal elastic net group penaltiesfor a range of α . We observe that there is usually little benefit from recalibration by cross-validation (reCV) in the linear case (Figure 2).Figure 3 shows the MSE performance on the test sets for all methods. Our proposed methodperforms well. The method squeezy (single) (either with or without reCV) outperforms theother methods in the single group setting. The difference in performance between squeezy(single+reCV) and EN may be due to loss of information due to the internally standardisa-tion of the linear response of EN . More importantly, squeezy (multi) (either with or withoutreCV) outperforms the other methods including squeezy (single) in the setting with infor-mative groups, illustrating the benefit of informative co-data. The performance of squeezy(multi) and squeezy (single) is more alike in the setting with weakly informative groups,and superior to the other methods. The method performs better when maximum marginallikelihood estimates for ridge penalties are transformed ( squeezy ) than when moment esti-mates are transformed ( ecpc EN squeezy ). Note that the performance of ipf might improveby using a larger grid, but only at a very substantial computational cost as shown in Figure4. While the difference in performance between squeezy and the other methods is smaller for α = 0 . squeezy (multi) still outperforms the other methods when the co-data is informative(Figure B1 in Appendix B). 8 n T i m e ( s )
123 4567 89 12 35 67 89123 4567 89 123567 89 p
123 4567 891 23 4567 89 1 23 567 89
GMethod aa aa aa aa a
1. EN2. fwEN 3. ipf4. ipf2 5. ecpcEN squeezy6. squeezy (single) 7. squeezy (multi)8. squeezy (single+reCV) 9. squeezy (multi+reCV)
Figure 4:
Model-based simulation study. Average computation time in training sets for varying n , p and G . Alpha A UC reCV FALSETRUE
68 107 9234 5 68 107 92 345 6 81079 234 5 68 1079 23451 1 1 1
Alpha A UC Method
1. EN2. fwEN3. fwEN (continuous)4. ipf5. gren6. ecpcEN squeezy7. squeezy (single)8. squeezy (multi)9. squeezy (single+reCV)10. squeezy (multi+reCV)
Figure 5:
Results of 10-fold cross-validation in miRNA data example. Left: AUC performance for squeezy(multi) and squeezy (multi+reCV) for evenly spaced values of α ∈ [0 , . Right: AUC performance forseveral values of α for various methods. Then, we fit the methods on 5 training sets to compare computation time for varying n , p and G . The group parameters are set according to the setting with informative groups. We set( n, p, G ) = (150 , , n, p, G while keeping the others fixed. Figure 4shows the average computation time for fitting the models. Unsurprisingly, the group-adaptivemethods are slower than the other methods, but squeezy scales substantially better than ipf in the number of groups. We apply squeezy to predict therapy response (clinical benefit versus progressed disease) usingmicroRNA (miRNA) data from a study on colorectal cancer (Neerincx et al., 2018), consistingof p = 2114 miRNA expression levels for n = 88 independent individuals. As co-data, we usefalse discovery rates (FDRs) for differential expression in the primary tumour versus adjacentcolorectal tissue. The FDRs were obtained in a previous study (Neerincx et al., 2015) from adifferent set of non-overlapping samples. These co-data were previously shown to be informativefor the prediction, miRNAs that are tumor-specific tend to be more important for the responseprediction (van Nee et al., 2020). Here, we discretise the continuous FDRs in G = 8 equallysized groups, such that we have around 10 samples per group parameter.We fit the methods listed above on 10 folds to compare cross-validated performance (Figure5) and average computation time (Table 1). Figure 5 (left) clearly shows the benefit of recal-ibration by cross-validation (reCV) in this logistic regression setting. Our method squeezy(multi+reCV) performs as well as gren (Figure 5 (right), cf 10 vs 5) but is around 35 times asfast. The method ecpcEN squeezy is twice as fast as squeezy (multi+reCV) and is competi-tive to squeezy (multi+reCV) and gren . These three methods all benefit from the informativeco-data and outperform the other methods. 9 ethod Time (s) Method Time (s)
1. EN 1.17 (1.06) 2. fwEN (groups) 12.20 (6.84)7. squeezy (single) 5.12 (3.29) 10. squeezy (multi+reCV) 13.58 (13.59)6. ecpcEN squeezy 5.48 (3.01) 3. fwEN (continuous) 14.11 (11.13)9. squeezy (single+reCV) 7.45 (5.77) 5. gren 351.52 (96.74)8. squeezy (multi) 10.47 (10.50) 4. ipf 422.39 (259.80)
Table 1:
Results of 10-fold CV in miRNA data example. Mean time per fold and standard deviation overfolds, sorted from shortest to longest time.
The proposed method, termed squeezy , computes fast marginal likelihood estimates for group-adaptive elastic net penalties. The method estimates those penalties by deducing them fromgroup-adaptive ridge penalties, for which we derive a low-dimensional representation of theTaylor approximation of the marginal likelihood and its first derivative for generalised linearmodels, using results from (Wood, 2011).
Squeezy implements this and uses Nelder-Meadoptimisation to find the penalties. An extension would be to also use the hessian (Wood,2011), which may speed up the optimisation.An alternative implementation of our method uses the R-package mgcv (Wood, 2011) toestimate the ridge penalties from the linear predictors, as detailed in (van de Wiel et al.,2020). As mgcv does not allow for more variables than samples, this solution cannot includeunpenalised variables. A fix to this is to include these as pre-estimated offsets. In the absenceof unpenalised variables, we found that the two implementations provided similar results atcomparable computing times. The alternative solution opens up for application of squeezy toa wider variety of models (provided that the elastic net counter part is also available) such asthe penalised Cox model for survival data.We showed that the marginal likelihood of ridge models also approximates the marginallikelihood of elastic net models, by using the asymptotic multivariate normality of the linearpredictor. This result also holds for other priors with finite variance. When co-data includesmany groups, for example, a hierarchical prior could be used for shrinkage or selection ongroup level. The result does not hold for priors with infinite variance, such as the highly sparsehorseshoe prior (Carvalho et al., 2009). Moreover, the practical use of our method for otherrelatively sparse priors should be tested.The method transforms the ridge penalties to elastic net penalties using the variance func-tion of the elastic net prior. We showed in a data example that it is beneficial to recalibrate thetransformed elastic net penalties in logistic regression. Furthermore, the simulation study anddata example showed that our method is more scalable and faster than other group-adaptivemethods, while it benefits from co-data and performs as well as or better than other (group-adaptive) methods.Should squeezy perform inferior to other methods, it may be useful to check whether thismight be due to invalidity of the multivariate normal assumption for η = X β . Once the elasticnet prior(s) for β are known, it is straightforward to generate multiple η realisations. Then,we suggest using those to generate a chi-square-based Q-Q plot, e.g. by the R -package MVN (Korkmaz et al., 2014), to verify multivariate normality.As our method is based on (marginal) likelihood, it in principle facilitates model selection interms of the hyper-parameters (e.g. single group versus multi-group) by the use of informationcriteria (Greven and Kneib, 2010). To what extent which criterion is useful in our setting is atopic for further research.We provide the R -package squeezy and scripts demonstrating the package and reproducingthe results on https://github.com/Mirrelijn/squeezy .10 cknowledgements The first author is supported by ZonMw TOP grant COMPUTE CANCER (40- 00812-98-16012). The authors would like to thank Lodewyk Wessels and Soufiane Mourragui (Nether-lands Cancer Insitute) for the many worthwhile discussions.
References
Anne-Laure Boulesteix, Riccardo De Bin, Xiaoyu Jiang, and Mathias Fuchs. Ipf-lasso: Integrative-penalized regression with penalty factors for prediction based on multi-omics data.
Computationaland mathematical methods in medicine , 2017, 2017.Carlos M Carvalho, Nicholas G Polson, and James G Scott. Handling sparsity via the horseshoe. In
Artificial Intelligence and Statistics , pages 73–80. AISTATS, 2009.F Eicker. A multivariate central limit theorem for random linear vector forms.
The Annals of Mathe-matical Statistics , pages 1825–1828, 1966.Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linearmodels via coordinate descent.
Journal of statistical software , 33(1):1, 2010.Sonja Greven and Thomas Kneib. On the behaviour of marginal and conditional aic in linear mixedmodels.
Biometrika , 97(4):773–789, 2010.Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graphlasso. In
Proceedings of the 26th annual international conference on machine learning , pages 433–440. ACM, 2009.Selcuk Korkmaz, Dincer Goksuluk, and Gokmen Zararsiz. Mvn: An r package for assessing multivariatenormality.
The R Journal , 6(2):151–162, 2014.L. Meier, S. van de Geer, and P. B¨uhlmann. The group Lasso for logistic regression.
J. R. Stat. Soc.Ser. B Stat. Methodol. , 70(1):53–71, 2008. ISSN 1369-7412.Magnus M M¨unch, Carel FW Peeters, Aad W van der Vaart, and Mark A van de Wiel. Adaptivegroup-regularized logistic elastic net regression.
Biostatistics , 12 2019. ISSN 1465-4644. doi: 10.1093/biostatistics/kxz062. kxz062.Maarten Neerincx, DLS Sie, MA Van De Wiel, NCT Van Grieken, JD Burggraaf, H Dekker, PP Eijk,Bauke Ylstra, C Verhoef, GA Meijer, et al. Mir expression profiles of paired primary colorectalcancer and metastases by next-generation sequencing.
Oncogenesis , 4(10):e170, 2015.Maarten Neerincx, Dennis Poel, Daoud LS Sie, Nicole CT van Grieken, Ram C Shankaraiah, Floor SWVan Der Wolf-De, Jan-Hein TM van Waesberghe, Jan-Dirk Burggraaf, Paul P Eijk, Cornelis Verhoef,et al. Combination of a six microrna expression profile with four clinicopathological factors forresponse prediction of systemic treatment in patients with advanced colorectal cancer.
PloS one , 13(8):e0201809, 2018.J Kenneth Tay, Nima Aghaeepour, Trevor Hastie, and Robert Tibshirani. Feature-weighted elasticnet: using” features of features” for better prediction. arXiv preprint arXiv:2006.01395 , 2020.Katarzyna Tomczak, Patrycja Czerwi´nska, and Maciej Wiznerowicz. The cancer genome atlas (tcga):an immeasurable source of knowledge.
Contemporary oncology , 19(1A):A68, 2015.Mark A van de Wiel, Dennis E Te Beest, and Magnus M M¨unch. Learning from a lot: Empirical bayesfor high-dimensional model-based prediction.
Scandinavian Journal of Statistics , 46(1):2–25, 2019.Mark A van de Wiel, Mirrelijn M van Nee, and Armin Rauschenberger. Fast cross-validation formulti-penalty ridge regression. arXiv preprint arXiv:2005.09301 , 2020.Mirrelijn M van Nee, Lodewyk FA Wessels, and Mark A van de Wiel. Flexible co-data learning forhigh-dimensional prediction. arXiv preprint arXiv:2005.04010 , 2020.Jurre R Veerman, Gwena¨el GR Leday, and Mark A van de Wiel. Estimation of variance components,heritability and the ridge penalty in high-dimensional generalized linear models.
Communicationsin Statistics-Simulation and Computation , pages 1–19, 2019.Simon N Wood. Fast stable direct fitting and smoothness selection for generalized additive models.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 70(3):495–518, 2008.Simon N Wood. Fast stable restricted maximum likelihood and marginal likelihood estimation of semi-parametric generalized linear models.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 73(1):3–36, 2011.Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.
Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 67(2):301–320, 2005. ppendixA Details for Laplace approximate marginal likelihoodfor group-adaptive ridge models We provide details of the derivation of a low-dimensional representation of the Laplace approx-imation for the marginal likelihood and its first derivative as stated in Equations (6) and (8).When some covariates are left unpenalised, we need to integrate over the penalised covariatesonly. Wood (2011) forms an ortogonal basis U for the range space of S . For group-adaptiveridge penalties, this orthogonal basis is simply given by the identity matrix, as S is diagonalwith zeros for unpenalised covariates. The proposed reparametrisation of X and S boils downto simply taking the columns in X that correspond to the penalised variables, ¯ X = X pen , andtaking the columns and rows of the penalty matrix that correspond to the penalised variables,¯ S = S pen = Λ pen = (cid:80) g λ g S g,pen . A.1 Laplace approximate marginal likelihood
The high-dimensional log-marginal likelihood approximation is given in Equation (5) in (Wood,2011) and may be written as:] − (cid:96) ( λ ) ≈ D p φ − (cid:96) s ( φ ) + log( ¯ X T W ¯ X + ¯ S ) − log( | S | + )2= − (cid:96) (ˆ β ) + 12 φ ˆ β T Λ pen ˆ β + 12 log( | X Tpen
W X pen + Λ pen | ) −
12 log( | Λ pen | ) . (A1)When we use Newton iterations to obtain ˆ β or ˆ η , upon convergence:ˆ β = ( X T W X + Λ) − X T ( y − µ + W X ˆ β ) , ˆ η = X ( X T W X + Λ) − X T ( y − µ + W ˆ η ) . We substitute this in the second term of the log-marginal likelihood approximation:ˆ β T Λˆ β = ( y − µ + W X ˆ β ) T X ( X T W X + Λ) − Λˆ β = ( y − µ + W X ˆ β ) T X ( I p × p − ( X T W X + Λ) − X T W X )ˆ β = ( y − µ + W X ˆ β ) T ˆ η − ˆ η T W ˆ η = ( y − µ ) T ˆ η . The latter two terms may be rewritten in a lower-dimensional representation by:12 log( | X Tpen
W X pen + Λ pen | ) −
12 log( | Λ pen | ) = −
12 log( | ( X Tpen
W X pen + Λ pen ) − Λ pen | )= −
12 log( | I pen × pen − ( X Tpen
W X pen + Λ pen ) − X Tpen
W X pen | )= −
12 log( | I n × n − W X pen ( X Tpen
W X pen + Λ pen ) − X Tpen | )= −
12 log( | I n × n − W H pen | ) , where efficient representations for H pen exist (van de Wiel et al., 2020).So, the low-dimensional Laplace approximation for the minus log-marginal likelihood is: − (cid:96) ( λ ) ≈ − (cid:96) (ˆ η ) + 12 φ ( y − µ ) T ˆ η −
12 log( | I n × n − W H pen | ) . .2 First derivative of Laplace approximate marginal likelihood We derive low-dimensional representations of the derivatives to ρ g = log( λ g ) for g = 1 , .., G ,as derived in a high-dimensional representation by (Wood, 2011). A.2.1 Derivative of ∂ ˆ η ∂ρ g Denote by I g ∈ R p × p the diagonal matrix with diagonal element k equal to 1 if k is in group g and 0 otherwise and define Λ g = λ g I g . The derivative for ˆ β is then given by (Wood, 2011): ∂ ˆ β ∂ρ g = − exp( ρ g )( X T W X + S ) − S g ˆ β = − exp( ρ g )( X T W X + Λ) − Λ g λ g ˆ β = − exp( ρ g )( X T W X + Λ) − Λ 1 λ g I g ˆ β = − exp( ρ g ) (cid:2) I p × p − ( X T W X + Λ) − X T W X (cid:3) λ g I g ˆ β = − exp( ρ g ) (cid:2) I p × p − ( X T W X + Λ) − X T W X (cid:3) λ g I g ( X T W X + Λ) − X T ( y − µ + W ˆ η ) . We can write the derivative for ˆ η as: ∂ ˆ η ∂ρ g = X ∂ ˆ β ∂ρ g = − exp( ρ g ) X (cid:2) I p × p − ( X T W X + Λ) − X T W X (cid:3) λ g I g ( X T W X + Λ) − X T ( y − µ + W ˆ η )= − exp( ρ g ) (cid:2) I n × n − X ( X T W X + Λ) − X T W (cid:3) X λ g I g ( X T W X + Λ) − X T ( y − µ + W ˆ η )= − exp( ρ g ) [ I n × n − HW ] 1 λ g H Tg ( y − µ + W ˆ η ) , with H g , which can be seen as a contribution of the g th group to the hat matrix, defined as: H g := X ( X T W X + Λ) − I g X T = W − P W (cid:16) I n × n − X pen Λ − pen X Tpen W P W − · [ W − + X pen Λ − X Tpen W P W − ] − (cid:17) λ − g X g X Tg , where we have used (van de Wiel et al., 2020): P = I n × n − W X unpen ( X Tunpen
W X unpen ) − X Tunpen W , [ X ( X T W X + Λ) − ] pen = W − P W (cid:16) X pen Λ − pen − X pen Λ − pen X Tpen W P W − · [ W − + X pen Λ − X Tpen W P W − ] − X pen Λ − pen (cid:17) . A.2.2 Derivative of ∂(cid:96) (ˆ η ) ∂ρ g Wood (2008) derives partials for the deviance of generalised linear models: ∂(cid:96) (ˆ β ) ∂ ˆ β = 1 φ X T y − µ V ( µ ) (cid:12) g (cid:48) ( µ ) , V ( µ ) (cid:12) g (cid:48) ( µ ) is element-wise and (cid:12) representing element-wise multiplication.The derivative of the log likelihood is then given by: ∂(cid:96) (ˆ η ) ∂ρ g = ∂(cid:96) (ˆ β ) ∂ρ g = ∂(cid:96) (ˆ β ) ∂ ˆ β T ∂ ˆ β ∂ρ g = − exp( ρ g ) 1 φ (cid:18) y − µ V ( µ ) (cid:19) T G (cid:48)− X ( X T W X + Λ) − Λ g λ g ˆ β = − exp( ρ g ) 1 φ (cid:18) y − µ V ( µ ) (cid:19) T G (cid:48)− [ I n × n − HW ] 1 λ g H Tg ( y − µ + W ˆ η ) , with G (cid:48) a diagonal matrix with diagonal elements g (cid:48) ( µ i ). A.2.3 Derivative of φ ( y − µ ) T ˆ η We have for the low-dimensional penalty term, recall that µ = g − ( η ): ∂∂ρ g (cid:18) φ ( y − µ ) T ˆ η (cid:19) = 12 φ (cid:18) ∂ ( y − µ ) T ∂ρ g ˆ η + ( y − µ ) T ∂ ˆ η ∂ρ g (cid:19) = 12 φ (cid:18) ∂ ( − µ T ) ∂ρ g ˆ η + ( y − µ ) T ∂ ˆ η ∂ρ g (cid:19) = 12 φ (cid:32) − ∂ ˆ η T ∂ρ g G (cid:48)− ˆ η + ( y − µ ) T ∂ ˆ η ∂ρ g (cid:33) = 12 φ (cid:0) − G (cid:48)− ˆ η + y − µ (cid:1) T ∂ ˆ η ∂ρ g = − exp( ρ g ) 12 φ (cid:0) − G (cid:48)− ˆ η + y − µ (cid:1) T [ I n × n − HW ] 1 λ g H Tg ( y − µ + W ˆ η ) . A.2.4 Derivative of − log( | I n × n − W H pen | )We have for the third term in Equation (A1): ∂∂ρ g (cid:18)
12 log( | X Tpen
W X pen + Λ pen | (cid:19) = 12 tr (cid:32) ( X Tpen
W X pen + Λ pen ) − ∂X Tpen
W X pen + Λ pen ∂ρ g (cid:33) = 12 tr (cid:18) ( X Tpen
W X pen + Λ pen ) − X Tpen ∂W∂ρ g X pen (cid:19) + 12 tr (cid:0) ( X Tpen
W X pen + Λ pen ) − Λ g (cid:1) = 12 tr (cid:18) X pen ( X Tpen
W X pen + Λ pen ) − X Tpen ∂W∂ρ g (cid:19) + 12 tr (cid:0) ( X Tpen
W X pen + Λ pen ) − Λ g (cid:1) = 12 tr (cid:18) H pen ∂W∂ρ g (cid:19) + 12 tr ( I g ) − tr (cid:0) Λ − pen X Tpen ( W − + X pen Λ − pen X Tpen ) − X pen I g (cid:1) = 12 tr (cid:18) H pen ∂W∂ρ g (cid:19) + 12 tr ( I g ) − tr (cid:0) λ − g X g X Tg ( W − + X pen Λ − pen X Tpen ) − (cid:1) , X Tpen
W X pen + Λ pen ) − = Λ − pen − Λ − pen X Tpen ( W − + X pen Λ − pen X Tpen ) − X pen Λ − pen H pen = X pen ( X Tpen
W X pen + Λ pen ) − X Tpen , with the partials of W ∈ R n × n readily obtained from (Wood, 2011).For the latter term in Equation (A1): ∂∂ρ g (cid:18) −
12 log( | Λ pen | ) (cid:19) = − tr (cid:18) Λ − pen ∂ Λ pen ∂ρ g (cid:19) = − tr (cid:18) Λ − pen I g ∂λ g ∂ρ g (cid:19) = − tr ( I g ) 1 λ g λ g = − tr ( I g ) . So, the derivative of the Laplace approximate minus log marginal likelihood is given by: ∂ − (cid:96) ( ρ ) ∂ρ g = 1 φ (cid:18) y − µ V ( µ ) (cid:19) T G (cid:48)− [ I n × n − HW ] H Tg ( y − µ + W ˆ η ) − φ (cid:0) − G (cid:48)− ˆ η + y − µ (cid:1) T [ I n × n − HW ] H Tg ( y − µ + W ˆ η )+ 12 tr (cid:18) H pen ∂W∂ρ g (cid:19) − tr (cid:0) λ − g X g X Tg ( W − + X pen Λ − pen X Tpen ) − (cid:1) . A.3 Details for linear regression
Denote by φ ( · ; µ, σ ) the normal pdf with mean µ and variance σ . The Laplace approximationis in fact exact for linear regression. We need the following parameters: φ = σ , V ( µ ) = 1 , (cid:96) (ˆ η ) = n (cid:88) i =1 log( φ ( y i ; ˆ η i , σ )) , µ = ˆ η , W = I n × n , g ( µ ) = µ, g (cid:48) ( µ ) = 1 , G (cid:48)− = I n × n . The partial derivative of the minus log likelihood to the scale parameter φ = σ from Equation(9) is given by: ∂ − (cid:96) (ˆ η , σ ) ∂σ = n σ − σ ( y − ˆ η ) T ( y − ˆ η ) . A.4 Details for logistic regression
We need the following parameters: φ = 1 , V ( µ i ) = µ i (1 − µ i ) , (cid:96) (ˆ η ) = n (cid:88) i =1 y i log( expit (ˆ η i )) + (1 − y i ) log(1 − expit (ˆ η i )) , µ = expit (ˆ η ) = (1 + exp( − ˆ η )) − ,W = diag ( µ (cid:12) (1 − µ )) , α i = 1 , ∂w i ∂η i = µ i (1 − µ i )(2 µ i − ,g ( µ i ) = logit ( µ i ) , g (cid:48) ( µ i ) = ( µ i (1 − µ i )) − , G (cid:48)− = W, g (cid:48)(cid:48) ( µ i ) = 2 µ i − µ i (1 − µ i ) . B Additional figures to the data examples
Figure B1 shows the MSE performance in the model-based simulation study for α = 0 . o groups Weakly informative groups Informative groups1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8200300400500250300350400450500550100125150175200 Method M SE Method
1. EN2. fwEN 3. ipf4. ecpcEN squeezy 5. squeezy (single)6. squeezy (multi) 7. squeezy (single+reCV)8. squeezy (multi+reCV)
Figure B1:
Model-based simulation study. Box plots of the MSE performance on the test sets of pairs oftraining and test sets, α = 0 . ..