Graphical Elastic Net and Target Matrices: Fast Algorithms and Software for Sparse Precision Matrix Estimation
Solt Kovács, Tobias Ruckstuhl, Helena Obrist, Peter Bühlmann
GGraphical Elastic Net and Target Matrices: Fast Algorithms andSoftware for Sparse Precision Matrix Estimation
Solt Kovács , Tobias Ruckstuhl , Helena Obrist , Peter Bühlmann Seminar for Statistics, ETH Zurich, Switzerland
January 7, 2021
Abstract
We consider estimation of undirected Gaussian graphical models and inverse covariances in high-dimensional scenarios by penalizing the corresponding precision matrix. While single L (GraphicalLasso) and L (Graphical Ridge) penalties for the precision matrix have already been studied, wepropose the combination of both, yielding an Elastic Net type penalty. We enable additional flexibil-ity by allowing to include diagonal target matrices for the precision matrix. We generalize existingalgorithms for the Graphical Lasso and provide corresponding software with an efficient implementa-tion to facilitate usage for practitioners. Our software borrows computationally favorable parts froma number of existing packages for the Graphical Lasso, leading to an overall fast(er) implementationand at the same time yielding also much more methodological flexibility. Keywords:
Gaussian graphical model;
GLassoElnetFast
R-package; Graphical Lasso; High-correlation;High-dimensional; ROPE; Sparsity.
The estimation of precision matrices (the inverse of covariance matrices) in high-dimensional settings,where traditional estimators perform poorly or do not even exist (e.g. the inverse of the sample covariancematrix), has developed considerably over the past years. For multivariate Gaussian observations (withmean vector µ and covariance matrix Σ ) a zero off-diagonal entry of the precision matrix Σ − encodesconditional independence of the two corresponding variables given all the other ones. The resultingconditional independence graph, with variables as nodes and edges for nonzero off-diagonal entries, iscalled a Gaussian graphical model (GGM, Lauritzen, 1996).Meinshausen and Bühlmann (2006) proposed a nodewise regression approach (selecting one variableas the response while taking all remaining ones as predictors and repeating this for each variable) usingthe Lasso (Tibshirani, 1996). The zero entries in the estimated Lasso regression coefficients serve as anestimate of the graphical model, i.e. the zero pattern, but it does not give a full estimate of the precisionmatrix. To overcome this problem, Zhou et al. (2011) proposed to use an unpenalized maximum likelihoodestimator for the covariance matrix based on the previously estimated graph. Yuan (2010) used theDantzig selector (Candès and Tao, 2007) for regression instead of the Lasso to get a preliminary estimate,followed by solving a linear program to obtain a symmetric matrix which is close to the preliminaryestimate (even being positive definite with high probability).A different approach for estimating precision matrices was proposed by Yuan and Lin (2007) basedon the maximization of an L -penalized Gaussian log-likelihood for positive definite matrices. The L -penalty for the precision matrix encourages sparsity and thus the zero pattern of the estimated precisionmatrix serves as a direct estimate of the underlying GGM. Early proposals for the (challenging) computa-tion of the estimates have been given by Yuan and Lin (2007), Banerjee et al. (2008) as well as using theGraphical Lasso ( glasso ) by Friedman et al. (2008). The glasso algorithm has been widely used in thepast and thus nowadays the L -penalized Gaussian log-likelihood precision matrix estimation problemitself is also called the Graphical Lasso (or glasso ). Other recent approaches for the computation of the1 a r X i v : . [ s t a t . M E ] J a n raphical Lasso estimator include for example those of Scheinberg et al. (2010); Hsieh et al. (2013, 2014)and Atchadé et al. (2015). Theoretical properties of the Graphical Lasso estimator were investigated byYuan and Lin (2007); Rothman et al. (2008); Lam and Fan (2009) and Ravikumar et al. (2011) (up to asmall difference of whether to penalize the diagonal elements of the precision matrix).The Graphical Lasso has been modified and adapted to several more specific scenarios, e.g. in thepresence of missing data (Städler and Bühlmann, 2012), change points (Londschien et al., 2020), unknownblock structure (Marlin and Murphy, 2009), in joint estimation proposals in the presence of groups sharingsome information (Guo et al., 2011; Danaher et al., 2012; Shan and Kim, 2018), with a prior clusteringof the variables (Tan et al., 2015), or even for confidence intervals based on a desparsified GraphicalLasso estimator (Jankova and van de Geer, 2015). While L -penalized estimation for GGMs is verypopular, numerous other approaches have been proposed as well, e.g. SCAD type penalty (Fan et al.,2009), the CLIME estimator (Cai et al., 2011), the SPACE estimator for partial correlation estimation(Peng et al., 2009), banded estimation (Bickel and Levina, 2008) or shrinkage-type approaches (Ledoitand Wolf, 2004). Let X i ∈ R p for i = 1 , . . . , n be i.i.d. p -dimensional Gaussian random vectors with mean µ ∈ R p and positive definite covariance matrix Σ ∈ R p × p with corresponding precision matrix Θ = Σ − . Let X ∈ R n × p be the data matrix comprising of the n observations, i.e. X = ( X , . . . , X n ) > . Without loss ofgenerality assume that the columns of X are centered, such that the sample covariance matrix is given by S = X T X /n . Furthermore, for a given matrix A ∈ R p × p let the matrix norms k·k and k·k be definedas k A k = P pi,j =1 | A ij | and k A k = k A k F = qP pi,j =1 | A ij | = p tr( A T A ) and similarly let k·k − and k·k − be the norms without the diagonal entries, i.e. k A k − = P i = j | A ij | and k A k − = qP i = j | A ij | .Let diag( A ) be the diagonal matrix with the diagonal elements of the matrix A as its entries. Finally,denote by A (cid:31) A is positive definite. Consider the following type of estimation problemfor estimating the unknown precision matrix Θ based on the n observations:ˆ Θ ( λ, α, T ) = argmin Θ (cid:31) {− log det Θ + tr( SΘ ) + λ ( α k Θ − T k + − α k Θ − T k ) } , (1)where T ∈ R p × p is a known (typically diagonal) positive semi-definite target matrix, λ ≥ α ∈ [0 , k Θ − T k − and k Θ − T k − and instead of the scalar λ one could further generalize the estimation byallowing entry-wise penalties. For the ease of reading, we will focus on the more restrictive formulation inequation (1) throughout the paper, but we enable the option for entry-wise penalties in our software. Notethat due to the L -penalty term, the scaling of the variables matters. Thus, it matters whether as input S , the sample covariance or sample correlation matrix is provided. The letter would be recommended inpractice in most scenarios.The formulation (1) encompasses several previously proposed high-dimensional precision matrix es-timators. The most classical one is the Graphical Lasso estimator (Friedman et al., 2008), occurring inthe case of α = 1 and for the target matrix being the zero matrix. The L -penalty (i.e., α = 0) wasrecently proposed independently by van Wieringen and Peeters (2016) and Kuismin et al. (2017). Thelatter authors refer to it as the Ridge type operator for precision matrix estimation ( rope ) with differenttarget matrices, while van Wieringen and Peeters (2016) called it the Alternative Ridge Precision esti-mator, with Type I for the case of zero as target matrix and Type II for nonzero positive definite targetmatrix. For the L -penalized estimators, closed form solutions exist (even with target matrices), unlikefor the Graphical Lasso, where various iterative procedures for the computations have been proposed, asmentioned earlier. The only proposal we are aware of incorporating target matrices for the L -penaltyis that of van Wieringen (2019). He is iteratively applying his generalized Ridge estimator (with ele-mentwise differing λ values) to approximate the loss function for the L case. However, when aiming forreasonably accurate estimates, this approach is computationally not attractive even for only moderatelysized problems. Now let us mention the combination of L and L -penalties, i.e., the case α ∈ (0 , T = ). Genevera Allen in her 2010Stanford University PhD thesis (Allen, 2010, Algorithm 2) adapted the glasso algorithm of Friedmanet al. (2008), similar to our proposal in section 3.2. Atchadé et al. (2015) proposed stochastic proximaloptimization methods to obtain near-optimal (i.e., approximate) solutions for regularized precision ma-trix estimation, and their approach incorporates also Elastic Net penalties (without target). They focuson large-scale problems, where the computation of exact solutions becomes impractical. Lastly, Rothman(2012) considers a sparse covariance matrix estimator and his proposed algorithm bears resemblance toour algorithms for Elastic Net type penalties. We are not aware of publicly available software corre-sponding to these proposals with Elastic Net penalties, moreover, efficient software for target matrices islimited to the Ridge penalty. We focus on the estimation problem (1). Elastic Net type penalties (for Gaussian log-likelihood basedprecision matrix estimation) could help to obtain both the advantages of L -penalized estimation (aspresented recently by van Wieringen and Peeters (2016) and Kuismin et al. (2017)) as well as benefits of L -penalization such as sparse precision matrix estimates that are desired for graphical models. Moreover,the inclusion of suitable target matrices could considerably improve estimation results. However, asdiscussed previously, only special cases have been treated so far in the literature both in terms of availablealgorithms as we well as corresponding software. Our goal is to contribute to computational approachesfor the general problem (1), i.e., to develop new algorithms in order to allow estimation in more flexibleways, and also to provide software with efficient implementations of these proposals for practitioners. Apractice oriented introduction to targets, a motivating real data example and some code snippets showingthe usage of the package in section 2 are aimed to facilitate the understanding and usage.Our algorithms are building upon the glasso algorithm of Friedman et al. (2008) and the morerecent dpglasso algorithm of Mazumder and Hastie (2012c) that offers certain advantages over itspredecessor. We describe these two algorithms (for the case without target matrices) in section 3.1.We then introduce our modifications in section 3.2 that lead to our gelnet (Graphical Elastic Net)and dpgelnet algorithms. In section 3.3 we further generalize these approaches to include diagonaltarget matrices. While diagonal matrices are less flexible than arbitrary positive semi-definite targetmatrices, we note that in practice many fall into this category (e.g. all target matrices that were usedin simulations by van Wieringen and Peeters (2016) and by Kuismin et al. (2017)). Our developedalgorithms are supported by mathematical derivations and theory. We present simulations in section 4comparing our new estimators with Elastic Net penalties and target matrices with the plain GraphicalLasso and also recent Ridge-type approaches. Competitive computational performance is demonstratedin section 5 by benchmarking our software to widely used packages such as the glasso (Friedman et al.,2019), the glassoFast (Sustik et al., 2018), and the dpglasso (Mazumder and Hastie, 2012a) R -packages interms of computational times. We present some target types that can be used within equation (1) and which were used in the simulationsin section 4.• True Diagonal: Taking the diagonal of the underlying true precision matrix. Of course, using thistarget is only possible in simulation settings, where the original precision matrix is known.• Identity: Taking the identity matrix as target. This choice of the target is very conservative (i.e.,having rather small entries) when the input S in problem (1) is the empirical correlation matrix.• v -Identity: Multiplying the identity matrix with a scalar v , where v is the inverse of the mean ofall diagonal entries in the sample covariance matrix (see Kuismin et al., 2017).3 Eigenvalue: The “DAIE” target type from the default.target function of the rags2ridges R -package(Peeters et al., 2020). This takes a diagonal matrix with average of inverse nonzero eigenvalues ofthe sample covariance matrix as entries. Eigenvalues under a certain threshold are set to zero.• Maximal Single Correlation: For variable j take from the remaining variables the one that has thehighest absolute correlation with j and denote this as k . The corresponding correlation is denotedas ρ jk . Then set T jj = ((1 − | ρ jk | ) · S jj ) − , where S jj is the j -th diagonal element of the samplecovariance matrix S .• Nodewise regression (Meinshausen and Bühlmann, 2006): The idea is similar as with the maximalsingle correlation, but we would like to allow for more than one predictor. For a variable j applya 10-fold cross-validation using Lasso regression (using all other variables than the j -th one aspredictors). Then set the j -th diagonal entry of the target matrix as the inverse of the minimalcross-validated error variance. Under suitable conditions, the target matrix diagonals converge tothe diagonal values of the true precision matrix Θ.Some further targets are implemented in the default.target function of the rags2ridges R -package ofPeeters et al. (2020). In order to illustrate differences between the methodologies, we apply glasso and gelnet (with α = 0 . p = 39 isoprenoidgenes from n = 118 samples from the plant Arabidopsis Thaliana. As there is no solid ground truthavailable, we only briefly show that different estimation approaches lead to different results and thus theadditional flexibility provided by target matrices and elastic net type penalties provides additional optionsfor real data applications compared to the glasso . In all examples below the penalization parameterfor each algorithms is chosen such that the resulting precision matrix only has 200 non-zero off-diagonalentries, which leads to 100 edges between the genes. In the first example (Figure 1) we take the samplecorrelation matrix for the genes and apply both glasso and gelnet (with α = 0 .
5) without targetmatrix (i.e. T = ).Figure 1: Original data and no target is used (i.e. T = ). The 100 edges chosen by glasso (onthe left) and gelnet with α = 0 . In the second example we adjust for the first principal component and then apply the same proce-dure to the new sample correlation matrix. Such an adjustment is common to remove potential hiddenconfounding variables. As shown on the green right plot of Figure 2, in this case fewer edges are incommon. 4igure 2:
Data adjusted for the first principal component and no target is used. The 100 edgeschosen by glasso (on the left) and gelnet with α = 0 . Instead of adjusting for the first principal component, we use in the third example the
Maximal SingleCorrelation target from subsection 2.1. As shown on the green right plot of Figure 3, in this case evenfewer edges are in common.Figure 3:
Original data and the
Maximal Single Correlation target is used. The 100 edges chosenby glasso (on the left) and gelnet with α = 0 . R -package GLassoElnetFast
In the following we provide a few examples on how to use our software.
Installing the
GLassoElnetFast R -package.
This can be done as follows.
Estimation with no target matrices.
We show here the options for the Graphical Lasso, the Graph-ical Elastic Net and the Rope method (Ridge penalty).5
Estimation with the identity target matrix.
We show here the options for the Graphical Lasso,the Graphical Elastic Net and the Rope method (Ridge penalty).
Estimation with the identity target matrix and using cross-validation.
We show here theoptions for the Graphical Elastic Net.
Further remarks about the gelnet and dpgelnet functions.
They include the following:• Instead of a scalar λ , one can also provide entry-wise penalties via the argument lambda (whichneeds to be a vector or a matrix in this case).6 Besides the estimated precision matrix ( Theta ), also an estimate of the covariance matrix ( W ) isreturned. This can be accessed by e.g. fitGelnet$W .• One can choose whether to penalize the diagonal via the penalize.diagonal argument.• One can provide warm starts (via arguments Theta and W ).• niter , del and conv are also part of the output yielding information about the number of iterations,change in parameter value and convergence ( TRUE or FALSE ) with the set number of iterationsand thresholds. One can set various thresholds (arguments outer.thr and inner.thr ) and iterationnumbers (arguments outer.maxit and inner.maxit ) when aiming for more accurate results or in casealgorithms face convergence issues.Other notable arguments for the gelnet function:• The argument zero allows to specify indices of the precision matrix which are constrained to bezero.• The argument
Target allows to specify a diagonal target matrix. Target matrices mentioned inthis paper are implemented via the function target and some further ones are implemented in the default.target function of the rags2ridges R -package.Further explanations can be obtained in the following help files: help ( gelnet )
Our implementations in the
GLassoElnetFast R -package build upon the glasso R -package (Friedmanet al., 2019) and the dpglasso R -package Mazumder and Hastie (2012a) by suitably modifying them.Additionally, we also utilized a faster implementation from the glassoFast R -package (Sustik et al., 2018;Sustik and Calderhead, 2012). Moreover, whenever possible, we even improve on the original versions. Forexample, to achieve further speed-ups, we incorporate block diagonal screening (Mazumder and Hastie,2012b; Witten et al., 2011) which was missing in the dpglasso and glassoFast packages. Compared to the dpglasso package, our implementation relies even more on
Fortran , which is beneficial for its speed, whilefor the estimation we also allow the k·k − penalty (instead of k·k only) and entry-wise penalties. Besidessuch technical improvements on existing software (which lead to some computational speed-up as shownin section 5), the main new features are the additional flexibility of allowing Elastic Net penalties anddiagonal target matrices as described earlier. Our implementation is planned to be made available as an R -package on CRAN . glasso and dpglasso to gelnet and dpgelnet glasso and dpglasso In this section we briefly present the glasso algorithm by Friedman et al. (2008) as well as the dpglasso algorithm by Mazumder and Hastie (2012c) and set up some notation which we will rely on when pre-senting in the next subsection 3.2 the modified versions with the Elastic Net penalties called gelnet and dpgelnet . We closely follow the derivation presented in Mazumder and Hastie (2012c). Both the glasso and the dpglasso algorithm seek to minimize the L -regularized negative log-likelihood over allpositive definite precision matrices Θ :ˆ Θ ( λ, α = 1 , T = ) = argmin Θ (cid:31) {− log det Θ + tr( SΘ ) + λ k Θ k } . We call the function inside the argmin the
Graphical Lasso loss , which is a special case of (1). We assumethe tuning parameter λ ≥ S required as the input isusually the empirical covariance or correlation matrix. The algorithms work with the normal equationscorresponding to the Graphical Lasso loss: − Θ − + S + λ Γ = , − W + S + λ Γ = , (2)where W := Θ − is called the working covariance matrix and the matrix Γ denotes the component-wisesigns of Θ with Γ i,j ∈ [ − ,
1] for Θ i,j = 0 . The approach used for solving the normal equations is toupdate one dimension (row and column) at a time while leaving the remaining ones fixed. The updaterequires solving a suitable penalized regression problem which can be performed efficiently by applyingcoordinate descent (Friedman et al., 2007). After updating, one proceeds with the next dimension (rowand column), until all dimensions have been updated once. Typically one repeats these update cycles forall the variables as long as differences are above a certain threshold used as a stopping criterion. Notation 1
Partition S , Θ , W and Γ into block form as follows: a matrix with dimensions ( p − × ( p − (e.g. S ), two vectors of length ( p − (e.g. s ) and a scalar (e.g. s ): S = (cid:18) S s s s (cid:19) , Θ = (cid:18) Θ θ θ θ (cid:19) , W = (cid:18) W w w w (cid:19) , Γ = (cid:18) Γ γ γ γ (cid:19) . (3)The main difference of glasso and dpglasso lies in the fact that glasso is actively working with W ,while dpglasso works with its inverse Θ . Furthermore, glasso and dpglasso use different block-matrixrepresentations for W , where properties of inverses of block-partitioned matrices are used and Θ = W − : W Glasso = (cid:18) W w w w (cid:19) = ( Θ − θ θ θ ) − − W θ θ . θ + θ W θ θ ! , (4) W DP Glasso = (cid:18) W w w w (cid:19) = Θ − − Θ − θ θ Θ − θ − θ Θ − θ − Θ − θ θ − θ Θ − θ . θ − θ Θ − θ . (5)As the remaining steps of the derivation (e.g. the regression problems involved in the row and columnupdates as well as the coordinate descent procedure to obtain solutions for it) are special cases of what isdiscussed in the next subsection 3.2, we only present the final algorithms here. Note that in Algorithm 1(and Algorithm 2) I denotes the identity matrix (of dimension p × p ). Algorithm 1 glasso algorithm Initialize W init = S + λ I . Cycle around the columns repeatedly, performing the following steps till convergence: a: Rearrange the rows/columns so that the currently updated column is last (implicitly). b: Solve the following Lasso problem with coordinate descent to get ˆ β :ˆ β = argmin β ∈ R p − { β T W β + β T s + λ k β k } . As warm start for β use the solution from the previous round for this row/column. c: Update the off-diagonal of the working covariance matrix as ˆ w = W ˆ β (and similarly for ˆ w ),but do not change the diagonal entry w . d: Save ˆ β for this row/column in a matrix B . Finally, for every row/column, compute the diagonal entries of Θ using ˆ θ = w − ˆ β ˆ w and obtainthe off-diagonal entries of Θ from the matrix B , where ˆ θ = − ˆ θ ˆ β and ˆ θ = ˆ θ T .8 lgorithm 2 dpglasso algorithm Initialize Θ init = diag( S + λ I ) − and W init = S + λ I . Cycle around the columns repeatedly, performing the following steps till convergence: a: Rearrange the rows/columns so that the currently updated column is last (implicitly). b: Solve the following quadratic program with coordinate descent for γ ∈ R p − :ˆ γ = argmin k γ k ∞ ≤ λ { ( s + γ ) T Θ ( s + γ ) } . c: Update ˆ θ = − Θ ( s +ˆ γ ) w and ˆ θ = ˆ θ T . d: Update ˆ θ = − ( s + ˆ γ ) T ˆ θ w . e: Update ˆw = s + ˆ γ and ˆw = ˆw T , but do not change the diagonal entry w .If the primary interest lies in the estimation of Θ , the dpglasso algorithm offers several advantages.It yields both a sparse and positive definite Θ , while glasso only ensures a sparse solution, whichis not necessarily positive definite in all cycles of the algorithm and thus rarely also when stopping.Moreover, dpglasso converges with all positive definite warm starts Θ w . glasso is guaranteed tomaintain positive definite updates in each step only if the warm start W w fulfills the conditions W w (cid:31) k W w − S k ∞ ≤ λ (which is the case for example for the standard initialization W init = S + λ I ). Fordetailed results and differences see Mazumder and Hastie (2012c). gelnet and dpgelnet In this section the modified versions of the glasso and dpglasso algorithms, named gelnet and dpgelnet are derived. The specific implementations for gelnet and dpgelnet utilize the shell of the glasso code provided by Friedman et al. (2019) and the dpglasso code provided by Mazumder andHastie (2012a), see section 2.3. Penalizing the negative log-likelihood with a combination of L and L terms leads to the following special case of problem (1):ˆ Θ ( λ, α, T = ) = argmin Θ (cid:31) {− log det Θ + tr( SΘ ) + λ ( α k Θ k + − α k Θ k ) } , (6)where λ is a non-negative tuning parameter, α ∈ [0 ,
1] is another tuning parameter, and S is a positivesemi-definite matrix (typically the covariance or correlation matrix). Minimizing expression (6) is calledthe Graphical Elastic Net or in short the gelnet problem. The corresponding normal equations are − Θ − + S + λα Γ + λ (1 − α ) Θ = , − W + S + λα Γ + λ (1 − α ) Θ = , (7)where Γ is the sign matrix of Θ and W is the working covariance matrix similar to previous notationfrom (2). The gelnet and dpgelnet algorithms follow the block coordinate descent approach of the glasso and dpglasso algorithms in solving the normal equations (7). Namely, update one row/columnat the time while leaving the remaining ones fixed. The update requires solving another optimizationproblem (for a regression problem), which can be performed efficiently by applying coordinate descent(Friedman et al., 2007). After updating, one proceeds with the next row/column update, until all ofthem have been updated once. Typically one repeats these update cycles for all the variables as long asdifferences are above the threshold used as the stopping criterion. Both algorithms work actively with Θ and W simultaneously, in contrast to glasso and dpglasso . The detailed derivation of gelnet and dpgelnet can be split into two problems:• Problem 1) Derive the formula for the row/column updates as well as the therein involved opti-mization problem.• Problem 2) Apply coordinate descent for the optimization from Problem 1).9oordinate descent (Tseng, 2001; Friedman et al., 2007) performs componentwise updates leaving theother coordinates fixed. If the function to be minimized is convex and can be decomposed into a differ-entiable, convex function plus a sum of convex functions of each individual parameter, then iterativelyupdating each coordinate is guaranteed to converge to the global minimizer. For more details, see Tseng(2001); Friedman et al. (2007) or Hastie et al. (2015). The functions to be minimized at each of therow/column updates of gelnet and dpgelnet fulfill this property and hence only the componentwiseupdates have to be derived. gelnet Solution to Problem 1)
Consider the p -th row of the normal equations in (7) without the diagonalentry, using notation from (3): − w + s + λα γ + λ (1 − α ) θ = . Use (4) for w and define β := − θ θ : W θ θ + s + λα γ + λ (1 − α ) θ = , W β − s − λα γ + λ (1 − α ) θ β = . (8)Note that γ ∈ − sign( β ) since θ >
0. Therefore (8) corresponds to the normal equation of the following L and L -regularized quadratic program:minimize β ∈ R p − { β T W β − β T s + λα k β k + λ − α θ β T β } , or equivalently: minimize β ∈ R p − { (cid:13)(cid:13)(cid:13) W / β − W − / s (cid:13)(cid:13)(cid:13) + λα k β k + λ − α θ k β k } . (9)After solving the quadratic program (see Problem 2) below), with minimizer ˆ β , update the entries asfollows:• ˆ w = − W θ θ = W ˆ β using the representation (4) and ˆ β = − θ θ • ˆ θ = w − ˆ β ˆ w by using (4), ˆ β = − θ θ and ˆ w = W ˆ β • ˆ θ = − ˆ θ ˆ β • ˆ w with the normal equations (7) Solution to Problem 2)
We show here how to apply componentwise updates for solving the L and L -regularized quadratic program (9). Using the notations Z = W / , y = W − / s , λ = λα and λ = λ (1 − α ) θ the quadratic program translates to a standard Elastic Net regression problem, i.e., R ( β ) := k y − Z β k + λ k β k + λ k β k needs to be minimized over β ∈ R p − . Let ˜ β k for k = j be estimates and partially optimize R ( β ) withrespect to β j , by computing the gradient at β j = ˜ β j . The gradient only exists if ˜ β j = 0. Without loss ofgenerality assume that ˜ β j >
0. Then: ∂R∂β j (cid:12)(cid:12)(cid:12)(cid:12) β = ˜ β = − p − X i =1 Z i,j ( y i − Z Ti ˜ β ) + λ + λ ˜ β j . With the partial residuals given as r ( j ) i = y i − P k = j Z ik ˜ β k and setting the partial derivative to 0 yields:˜ β j = P p − i =1 Z i,j r ( j ) i − λ P p − i =1 Z i,j + λ .
10 similar derivation can be done for ˜ β j <
0. The case ˜ β j = 0 is treated separately using standardsub-differential calculus. The overall solution for ˜ β j satisfies:˜ β j = S λ (cid:16)P p − i =1 Z i,j r ( j ) i (cid:17)P p − i =1 Z i,j + λ , (10)where S λ ( x ) = sign( x )( | x | − λ ) + is the soft-thresholding operator. Using the inner products Z T Z = W and Z T y = s , equation (10) can be written as.˜ β j = S λ (cid:16) ( s ) j − P k = j ( W ) k,j ˜ β k (cid:17) ( W ) j,j + λ . Putting all these pieces from Problem 1) and Problem 2) together yields the gelnet algorithm (Algorithm3 below).
Algorithm 3 gelnet algorithm Initialize Θ init = diag( S + λα I ) − and W init = S + λα I + λ (1 − α ) Θ init . Cycle around the columns repeatedly, performing the following steps till convergence: a: Rearrange the rows/columns so that the currently updated column is last (implicitly). b: Solve the Elastic Net regression problem (9) with coordinate descent to get ˆ β . As warm start for β use the solution from the previous round for this row/column. c: Update ˆ w = W ˆ β , ˆ w = ˆ w T . d: Update ˆ θ = w − ˆ β ˆ w . e: Update ˆ θ = − ˆ θ ˆ β , ˆ θ = ˆ θ T . f: Update ˆ w = s + λα + λ (1 − α )ˆ θ . dpgelnet Solution to Problem 1)
Consider the p -th row of the normal equations in (7) without the diagonalentry using notation from (3): − w + s + λα γ + λ (1 − α ) θ = . Use (5) for w and w . Then multiply with Θ from the left to obtain Θ − θ θ − θ Θ − θ + s + λα γ + λ (1 − α ) θ = , Θ − w θ + s + λα γ + λ (1 − α ) θ = ,w θ + Θ ( s + λα γ + λ (1 − α ) θ ) = . (11)Consider the case with α ∈ (0 , γ := λα γ as well as ˜ q := abs( θ ) and get the followingKarush-Kuhn-Tucker (KKT) conditions: w λα ˜ q ∗ ˜ γ + Θ ( s + (1 + − αα ˜ q ) ∗ ˜ γ ) = , ˜ q ∗ (abs(˜ γ ) − λα p − ) = , k ˜ γ k ∞ ≤ λα, (12)where ∗ denotes elementwise multiplication. These are equivalent to the following box-constrainedquadratic program for γ ∈ R p − :minimize k γ k ∞ ≤ λα { ( s + (1 + − αα ˜ q ) ∗ γ ) T Θ ( s + (1 + − αα ˜ q ) ∗ γ ) } . (13)In the special case where α = 0, the normal equations (11) simplify to: w θ + Θ ( s + λ θ ) = . q = abs( θ ) to get the following Karush-Kuhn-Tucker (KKT) conditions: w ˜ q ∗ γ + Θ ( s + λ ˜ q ∗ γ ) = , ˜ q ∗ (abs( γ ) − p − ) = , k γ k ∞ ≤ . These are equivalent to the following box-constrained quadratic program for γ ∈ R p − :minimize k γ k ∞ ≤ ( s + λ ˜ q ∗ γ ) T Θ ( s + λ ˜ q ∗ γ ) . After solving the quadratic program (see Problem 2) below) with solution γ ∗ , update for α ∈ (0 ,
1] inthe following way (and with very similar updates for the case α = 0):• ˆ θ = − Θ ( s +(1+ 1 − αα ˜ q ) ∗ γ ∗ ) w using (12) and that ˆ θ = ˜ q ∗ ˜ γ • ˆ θ = w + ˆ θ T Θ − ˆ θ by (5) and then use ˆ θ to get ˆ θ = − ( s +(1+ 1 − αα ˜ q ) ∗ γ ∗ ) T ˆ θ w • ˆ w and ˆ w with the normal equations (7) Solution to Problem 2)
We show here how to apply componentwise updates for solving the boxconstrained quadratic program (13). Taking the derivative of the j -th coordinate with respect to c forthe quadratic program of the form ( a + b ∗ c ) T D ( a + b ∗ c ) leads to:[ ( a + b ∗ c ) T D ( a + b ∗ c )] ( j ) = [ a T Da ] ( j ) + [( b ∗ c ) T Da ] ( j ) + [( b ∗ c ) T D ( b ∗ c )] ( j ) = 0 + ( b ∗ ( Da )) j + ( b ∗ ( D ( b ∗ c ))) j = ( b ∗ ( D ( a + b ∗ c ))) j = b j p X k =1 d j,k ( a k + b k c k ) . Setting b j P pk =1 d j,k ( a k + b k c k ) = 0 one obtains d j,j ( a j + b j c j ) = p X k =1 ,k = j d j,k ( a k + b k c k ) ,c j = − P pk =1 d j,k ( a k + b k c k ) + d j,j b j c j d j,j b j . This leads to the following coordinate update of γ j γ j,new = − P pk =1 ( Θ ) j,k ( s + (1 + − αα ˜ q ) k ∗ γ k,old ) + ( Θ ) j,j (1 + − αα ˜ q ) j ∗ γ j,old ( Θ ) j,j (1 + − αα ˜ q ) j . In the special case where α = 0, γ j,new = − P pk =1 ( Θ ) j,k ( s + λ ( ˜ q ) k ∗ γ k,old ) + ( Θ ) j,j λ ( ˜ q ) j ∗ γ j,old ( Θ ) j,j λ ( ˜ q ) j . Putting all these pieces from Problems 1) and 2) together yields the dpgelnet algorithm. Algorithm 4below shows the procedure for α ∈ (0 ,
1] (which changes only slightly for the case α = 0).12 lgorithm 4 dpgelnet algorithm Initialize Θ init = diag( λα I + S ) − and W init = S + λα I + λ (1 − α ) Θ init . Cycle around the columns repeatedly, performing the following steps till convergence: a: Rearrange the rows/columns so that the currently updated column is last (implicitly). b: Solve (13) and denote the solution as γ ∗ . c: Update ˆ θ = − Θ ( s +(1+ 1 − αα ˜ q ) ∗ γ ∗ ) w . d: Update ˆ θ = − ( s +(1+ 1 − αα ˜ q ) ∗ γ ∗ ) T ˆ θ w . e: Update ˆ w = s + γ ∗ + λ (1 − α ) ˆ θ . f: Update ˆ w = s + λα + λ (1 − α )ˆ θ . Lemma 1
Suppose Θ (cid:31) is used as warm-start for the dpgelnet algorithm. Then every row/columnupdate of dpgelnet maintains the positive definiteness of the working precision matrix Θ . Note that acorresponding lemma for the dpglasso is proven by Mazumder and Hastie (2012c) and hence, we justprovide a simple extension here. Proof 1
Let A = (cid:18) A a a a (cid:19) . The condition A (cid:31) is equivalent to: A (cid:31) and ( a − a ( A ) − a ) > . (14) Consider updating the p -th row/column of the precision matrix. Since the block Θ remains fixed weonly need to show the second condition from (14) . Using the updates of the dpgelnet (Algorithm 4): ˆ θ − ˆ θ T ( Θ ) − ˆ θ = 1 − ( s + (1 + − αα ˜ q ) ∗ γ ∗ ) T ˆ θ w − − ( s + (1 + − αα ˜ q ) ∗ γ ∗ ) T ( Θ ) − Θ ˆ θ w = 1 w = 1 s + λα + λ (1 − α ) θ > . The problem of solving p -dimensional glasso (or dpglasso ) problems can be reduced in some casesto solving several lower-dimensional problems, enabling massive speed ups for the computations (seeMazumder and Hastie, 2012b; Witten et al., 2011). A similar result also holds for gelnet (or dpgelnet ).Following the notation of Mazumder and Hastie (2012b), define the nodes V = { , ..., p } and the matrix E : E ij = ( ˆΘ ij ( λ, α ) = 0 , i = j, ˆΘ ( λ, α ) is the estimated precision matrix for the input covariance matrix S and the two tuningparameters λ and α , see equation (1). This defines a symmetric graph G = ( V , E ). Decompose this graphinto its connected components, i.e., G = S ˜ Ll =1 G l where ˜ L is the number of connected components and G l = ( V l , E l ) the l -th sub-graph. Furthermore, define E as E ij = ( | S ij | > λα, i = j G = ( V , E ). Decompose this graph into its connected componentsas well, i.e., G = S Ll =1 G l where L is the number of connected components and G l = ( V l , E l ). Theorem 1 (taken from Atchadé et al., 2015)
Let G l = ( V l , E l ) , l = 1 , ..., ˜ L and G l = ( V l , E l ) , l =1 , ..., L denote the connected components, as defined above. Then ˜ L = L and there exists a permutation Π on { , ..., L } such that V Π( l ) = V l and E Π( l ) = E l for all l = 1 , ..., L . E ) is com-putationally cheap. Note that parts of ˆΘ corresponding to the different connected component can bethen solved independently, i.e. a suitable permutation of Θ and W leads to block-diagonal form. The L blocks are exactly of the sizes of the connected components. This result is especially attractive if themaximum size of the connected components is small compared to p , since the additional effort to computethe connected components is negligible compared to the gains of reducing the problem into smaller sizedproblems. For fixed α = 0 and S the number of connected components is increasing in λ . If λα ≥ S ij for all i, j ∈ V with i = j then Θ and W are diagonal matrices. We incorporated the check of suchconnected components prior to starting actual calculations in our implementation in the GLassoElnetFastR -package.
We now turn towards the inclusion of a positive semi-definite (diagonal) target matrix T into the Graph-ical Elastic Net problem. Some motivation for this is provided by van Wieringen and Peeters (2016) aswell as Kuismin et al. (2017) in the case of Ridge type penalization. The target can be interpreted asprior knowledge or an educated guess. We aim to solve problem (1) which we recall here:ˆ Θ ( λ, α, T ) = argmin Θ (cid:31) {− log det Θ + tr( SΘ ) + λ ( α k Θ − T k + − α k Θ − T k ) } . This optimization problem is difficult to solve efficiently in general. In the L -penalty case, van Wieringen(2019) proposed to iteratively apply his generalized Ridge estimator (with elementwise differing λ ) toapproximate the loss function for the L case. However, this approach is computationally not attractive,even for moderately sized problems (see section 5 for computational times). We propose to simplify theproblem by considering only positive semi-definite diagonal target matrices. While diagonal matricesare less flexible than arbitrary positive semi-definite target matrices, we note that in practice many fallinto this category (e.g. all target matrices that were used by van Wieringen and Peeters (2016) and byKuismin et al. (2017)).When considering diagonal target matrices with non-negative entries, the normal equations for thediagonal entries are different, while for the other entries they remain the same. For each diagonal entrythree cases can occur, leading to the following normal equations:Case 1: θ > t , then w = s + λα + λ (1 − α )( θ − t ) . Case 2: θ = t , then w = s + λαu, where u ∈ [ − , . Case 3: θ < t , then w = s − λα + λ (1 − α )( θ − t ) . Note that in the traditional setting when zero is the target matrix, then always case 1 occurs. In thegeneral case, however, one cannot automatically update diagonal elements based on case 1. We derive thetechnical modifications in Appendix A.1 on how to modify the gelnet algorithm. Overall, in contrastto the algorithm without target, the initialization and updating for the diagonal entries require care. Wenote that our chosen updates work for realistic targets, but occasionally may not converge for targetswith very large diagonal entries. However, such targets are not desirable from a statistical perspective.
Our proposed algorithms gelnet and dpgelnet are generalizations of glasso and dpglasso . For theLasso case α = 1, gelnet performs the same updates as glasso and dpgelnet performs the sameupdates as dpglasso . For elastic net penalties, i.e. α ∈ (0 , gelnet and dpgelnet areslightly modified to incorporate the additional quadratic penalty term. Lemma 1 ensures that dpgelnet maintains positive definite precision matrices throughout the algorithm for arbitrary positive definiteprecision matrices as warm starts. This feature is appealing when a fit is already available, for example14rom a slightly different λ value in cross validation or in change point detection problems from a neigh-bouring split point (see Kovács et al., 2020 and Kovács et al., 2020). gelnet lacks this feature, and forcertain warm starts it may not converge. However, as an advantage for gelnet , we have updates thatallow to incorporate diagonal target matrices, and hence, is more flexible than dpgelnet in its currentlypresented form. As long as single fits are required, (i.e. without warm starts), we recommend gelnet . Ifrepeated fits relying on warms starts are necessary in some application, one can still try to use gelnet with warm starts and in case this faces convergence issues, resort to dpgelnet . In this chapter the statistical performance of the Graphical Elastic Net estimator is compared to its specialcases glasso (Friedman et al., 2008) and rope (Kuismin et al., 2017). Note that van Wieringen andPeeters (2016) also proposed Ridge-type penalization and they called it the Alternative Ridge Precisionestimator (with Type I for the case of zero as target matrix and Type II for nonzero positive definitetarget matrix). For simplicity, we will use the name rope in the following for such Ridge-type penalizationapproaches.
Simulation models.
The simulation setup is similar to that of Kuismin et al. (2017). We draw n independent realizations from multivariate Gaussian distributions N ( , Σ ), where Σ = [ σ i,j ] ∈ R p × p and Θ = [ θ i,j ] = Σ − are positive definite matrices coming from 6 different models.• Model 1
Compound symmetry model : σ i,i = 1 and σ i,j = 0 . for i = j .Model 2-4 are taken from Liu and Wang (2017). In these models an adjacency matrix A is generatedfrom a graph, where each nonzero off-diagonal element is set to 0.3 and the diagonal elements to 0. Thenthe smallest eigenvalue Λ min ( A ) is calculated and the corresponding precision matrix is constructed by Θ = D ( A + ( | Λ min ( A ) | + 0 . · I ) D , where D ∈ R p × p is a diagonal matrix with D i,i = 1 for i = 1 , . . . , p and D i,i = 3 for i = p + 1 , . . . , p .• Model 2
Scale-free graph model : The graph begins with an initial small chain graph of 2 nodes.New nodes are added to the graph one at a time. Each new node is connected to one existing nodewith a probability proportional to the number of degrees that the existing node already has.•
Model 3
Hub graph model : The p nodes are evenly partitioned into p disjoint groups with eachgroup containing 10 nodes. Within each group, one node is selected as the hub and edges betweenthe hub and the other 9 nodes are added.• Model 4
Block graph model : Here ˜ Θ is directly produced by making a block diagonal matrixwith block size p , where the off-diagonal entries are set to 0.5 and diagonal entries to 1. Thematrix is then randomly permuted by rows/columns and the resulting covariance matrix is takenas Σ = D − ˜ Θ − D − , where this time D i,i = 1 for i = 1 , . . . , p and D i,i = 1 . i = p + 1 , . . . , p .Models 5 & 6 are graphical models coming from Cai et al. (2016). First generate the matrix ˜ Θ = [˜ θ i,j ]and then multiply with the inverse of a diagonal matrix D from both sides to get Σ . Each diagonal entryof D is independently generated from a uniform distribution on the interval 1 to 5.• Model 5
Band graph model : ˜ θ i,i = 1, ˜ θ i,i +1 = ˜ θ i +1 ,i = 0 .
6, ˜ θ i,i +2 = ˜ θ i +2 ,i = 0 .
3, ˜ θ i,j = 0 for | i − j | ≥ Model 6
Erdős-Rényi random graph model : Take ˜˜ Θ = [˜˜ θ i,j ], where ˜˜ θ i,j = u i,j · δ i,j , suchthat δ i,j is a Bernoulli random variable with success probability 0.05 and u i,j is a uniform randomvariable on the interval 0.4 to 0.8. Then take ˜ Θ = ˜˜ Θ + ( | Λ( ˜˜ Θ ) min | + 0 . · I .For all models we transform the covariance matrix to be a correlation matrix before simulating thedata. With the exception of Model 1, all models have a sparse structure in Θ . In order to get performancemeasures for the different methods, 100 independent simulations for each model are performed and theaverages for several loss functions are calculated. Using five-fold cross-validation the optimal tuningparameter λ for each method is determined for each simulation run separately.15 erformance measures. The different measures used in the simulations can be split into two groups.1) Loss functions used by Kuismin et al. (2017):•
Kullback-Leibler loss:
KL = tr( Σ ˆ Θ ) - log(det( Σ ˆ Θ )) − p • L2 loss:
L2 = k Θ − ˆ Θ k F • Spectral norm loss:
SP = d , where d is the largest eigenvalue of the matrix ( Θ − ˆ Θ ) T ( Θ − ˆ Θ )2) Graph recovery measures: Let ˆ Θ be the estimated solution and Θ the underlying truth. For a giventhreshold (cid:15) define the adjacency matrices ˆ A and A :ˆ A i,j = ( Θ i,j ≥ (cid:15) A i,j = ( Θ i,j ≥ (cid:15) i < j the edge ij in ˆ A is present if ˆ A i,j = 1, and similar for A . Now define the followingquantities.• True Positives:
TP = number of edges ij , which are both in ˆ A and in A • True Negatives:
TN = number of edges ij , which are both not in ˆ A and not in A • False Positives:
FP = number of edges ij in ˆ A but not in A • False Negatives:
FN = number of edges ij in A but not in ˆ A Then define the graph recovery measures.•
F1 score:
F1 = • Matthews correlation coefficient:
MCC = TP × TN - FP × FN √ (TP+FP)(TP+FN)(TN+FP)(TN+FN) Note that both of the latter measures take values in [0 ,
1] with 0 being the worst and 1 being the bestvalue. The threshold (cid:15) in the simulations is fixed as (cid:15) = 10 − . When interpreting later on the resultsshown in Figure 7 and Figure 8, one should keep in mind that the rope algorithm produces non-sparsesolutions and therefore almost no true negatives and false negatives are produced for (cid:15) = 10 − . Moreover,Model 1 is not sparse and thus analyzing the graph recovery measures in this model is not meaningful.16 kkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model1 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model2 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model3 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model4 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model5 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model6 kk kk kk kk kk kk kk kk kk kk kk k
GelnetGelnetEV GelnetFGelnetI GelnetMSCGelnetReg GelnetTrueGelnetvI GlassoGlassoEV GlassoFGlassoI GlassoMSCGlassoReg GlassoTrueGlassovI RopeRopeEV RopeIRopeMSC RopeRegRopeTrue RopevI KL − Dimension 100
Figure 4:
Summary of KL loss for the different models and different methods based on 100 repli-cations. The columns along the small black dots indicate mean losses. The bars on the top of eachcolumn show the standard deviations (mean ± SD). Plot layout is taken from Kuismin et al. (2017). kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model1 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model2 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model3 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model4 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model5 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model6 kk kk kk kk kk kk kk kk kk kk kk k
GelnetGelnetEV GelnetFGelnetI GelnetMSCGelnetReg GelnetTrueGelnetvI GlassoGlassoEV GlassoFGlassoI GlassoMSCGlassoReg GlassoTrueGlassovI RopeRopeEV RopeIRopeMSC RopeRegRopeTrue RopevI L2 − Dimension 100
Figure 5:
Summary of L2 loss for the different models and different methods based on 100 repli-cations. The columns along the small black dots indicate mean losses. The bars on the top of eachcolumn show the standard deviations (mean ± SD). Plot layout is taken from Kuismin et al. (2017). kkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model1 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model2 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model3 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model4 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model5 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model6 kk kk kk kk kk kk kk kk kk kk kk k
GelnetGelnetEV GelnetFGelnetI GelnetMSCGelnetReg GelnetTrueGelnetvI GlassoGlassoEV GlassoFGlassoI GlassoMSCGlassoReg GlassoTrueGlassovI RopeRopeEV RopeIRopeMSC RopeRegRopeTrue RopevI SP − Dimension 100
Figure 6:
Summary of SP loss for the different models and different methods based on 100 repli-cations. The columns along the small black dots indicate mean losses. The bars on the top of eachcolumn show the standard deviations (mean ± SD). Plot layout is taken from Kuismin et al. (2017). kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model1 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model2 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model3 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model4 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model5 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model6 kk kk kk kk kk kk kk kk kk kk kk k
GelnetGelnetEV GelnetFGelnetI GelnetMSCGelnetReg GelnetTrueGelnetvI GlassoGlassoEV GlassoFGlassoI GlassoMSCGlassoReg GlassoTrueGlassovI RopeRopeEV RopeIRopeMSC RopeRegRopeTrue RopevI F1 − Dimension 100
Figure 7:
Summary of F1 score for the different models and different methods based on 100replications. The columns along the small black dots indicate mean scores. The bars on the top ofeach column show the standard deviations (mean ± SD). Plot layout is taken from Kuismin et al.(2017). kkkkkkkkkkkkkkkkkkkkkk − − m ean l o ss Model1 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model2 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model3 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model4 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model5 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model6 kk kk kk kk kk kk kk kk kk kk kk k
GelnetGelnetEV GelnetFGelnetI GelnetMSCGelnetReg GelnetTrueGelnetvI GlassoGlassoEV GlassoFGlassoI GlassoMSCGlassoReg GlassoTrueGlassovI RopeRopeEV RopeIRopeMSC RopeRegRopeTrue RopevI
MCC − Dimension 100
Figure 8:
Summary of MCC score for the different models and different methods based on 100replications. The columns along the small black dots indicate mean scores. The bars on the top ofeach column show the standard deviations (mean ± SD). Plot layout is taken from Kuismin et al.(2017).
Details on the methods used.
Note that due to the L -penalty term, the scaling of the variablesmatters. Thus, as input S , we recommend to use the sample correlation rather than covariance in practice.For the Graphical Elastic Net we fixed α = for the Elastic Net penalty in equation (1) whenever usingit. The simulations were done with and without targets as well as with and without penalizing thediagonal. Notice that in our case the targets are diagonal matrices and not penalizing the diagonalautomatically results in no target. The following abbreviations are used in the figures, and the targetmatrices mentioned here are described in section 2.1.• F: Setting the penalization parameter to FALSE, i.e. no penalization of the diagonal.• True: Using the True Diagonal target matrix.• I: Using the
Identity target matrix.• vI: Using the v - Identity target matrix.• EV: Using the
Eigenvalue target matrix.• MSC: Using the
Maximal Single Correlation target matrix.• Reg: Using the
Nodewise Regression target matrix.
Simulation results.
The results are displayed in Figures 4, 5, 6, 7 and 8. As the dpgelnet algorithmleads to indistinguishable results compared to the gelnet algorithm, we left it out from the figuresand also in the discussion below. Various performance measures might yield different ranking of themethods, such that it is hard to draw an overall conclusion on which method is superior as well as ageneral recommendation on which method or target matrices to use. Nonetheless, we highlight here afew observations based on the results shown in the figures:• Not penalizing the diagonal of Θ leads to smaller losses than penalizing with zero as a target.19 While the message from Kuismin et al. (2017), that rope with target performs better than glasso with no target, in some of the models holds true, there are some models opposing this statementin general.• In general, it is of advantage to include a target in the algorithms, as each method performs betterif a suitable target is chosen.• The question of how to determine the optimal target matrix is still open. Our simulations showthat the True Diagonal target matrix performs best in terms of L2-Loss and SP-Loss and thusapproaching the diagonal of the underlying precision matrix is desired. However, the
True Diagonal target is not necessarily the winner in terms of
Kullback-Leibler loss .• There is a systematic bias for the estimated diagonal entries of the covariance matrices if diagonalentries are penalized. For example, for the standard glasso algorithm with zero as target matrix,the diagonal of ˆ W is set as ˆ w i,i = s i,i + λ (if the diagonal is penalized). Similarly, bias of thediagonal entries occurs also for Elastic Net penalties and target matrices. In all these cases, onecould try re-scaling the matrix, such that the ˆ w i,i = s i,i . All statements above would still hold, butare often less clearly visible. Including a reasonable target matrix seems to improve the estimation and is thus recommended. Theintuition that the true diagonal (or something close to it) is the best diagonal target does not necessarilyhold. Furthermore, there is no overall winner. Depending on the underlying precision matrix and theconsidered performance measure, different target types might be better suited. Hence, we recommend toexplore different target types as a tool to gain better insight into the data. The benefits of Elastic Netpenalties are not clearly visible in the considered simulations. Bigger differences are expected for highlycorrelated variables, analogous to the findings in regression by Zou and Hastie (2005), where Elastic Netcould potentially lead to more stable estimates.
Our implementations in the
GLassoElnetFast R -package build upon the glasso (Friedman et al., 2019) andthe dpglasso
Mazumder and Hastie (2012a) R -packages by suitably modifying them. Additionally, we alsoutilized a faster implementation from the glassoFast R -package (Sustik et al., 2018; Sustik and Calderhead,2012) that avoids unnecessary copying of subsets of the working covariance matrix when setting up rowand column updates, which causes some inefficiency for the glasso package. Moreover, whenever possible,we even improve on the original versions. For example, to achieve further speed-ups, we incorporateblock diagonal screening (Mazumder and Hastie, 2012b; Witten et al., 2011) which was missing in the dpglasso and glassoFast packages. Compared to the dpglasso package, our dpgelnet implementation relieseven more on Fortran , and it avoids unnecessary copying of working covariance matrices (in the style ofthe glassoFast package), which are beneficial for its speed.
Figure 9 compares the computational speed for Graphical Lasso estimation problems (i.e., α = 1) using theimplementations from the GLassoElnetFast , glasso and glassoFast R -packages. Our gelnet implementationin the GLassoElnetFast package is implemented similar to the glassoFast package and hence, similarly fast.However, we additionally included the block-diagonal screening rule of Mazumder and Hastie (2012b);Witten et al. (2011). As shown on the right plot of Figure 9, in case the problem can be decomposedinto connected components (see also Lemma 1), our implementation can be considerably faster. Theremight be problems though where everything (or at least a predominant majority of the nodes) belong tothe same connected components, see the left of Figure 9. In such cases our implementation tends to beslightly slower compared to the glassoFast package, because checking for the connected components hasto be done and additionally slightly more computations are carried out because our implementation can20andle general Elastic Net penalties rather than only fine tuned computations for the Graphical Lassocase of α = 1. computational time (seconds) c o m pu t ed a cc u r a cy f o r W - - - - - - gelnet glasso glassoFast computational time (seconds) c o m pu t ed a cc u r a cy f o r W - - - - gelnet glasso glassoFast Figure 9:
Computed accuracy (maximal absolute error of the off-diagonal entries) for the workingcovariance matrix W (on a logarithmic scale) vs. average computational times of 10 runs each. Onthe left the input correlation matrix is the one from the ER dataset ( p = 692, available in the QUIC R -package of Hsieh et al., 2014). On the right analogous results are shown for a p = 3460-dimensional problem composed of five blocks of size 692 each, i.e., the input here is a block diagonalmatrix with five blocks, where the correlation matrix of the ER dataset was used for each of thefive blocks. In all cases a standard Graphical Lasso estimator (i.e., α = 1) with tuning parameter λ = 0 . GLassoElnetFast , glasso and glassoFastR -packages. We introduced new methodology for sparse precision matrix estimation and in the following we wouldlike to demonstrate efficient implementations also for these new proposals. To measure the computationaltime of the different methods the
FHT data set available e.g. in the R -package gcdnet (Yang and Zou,2017), is used. It contains n = 50 observations and p = 100 variables, for which we calculate the empiricalcorrelation matrix S ∈ R p × p . For a fixed penalty parameter λ each method computes the solution 20 timeswithout warm start and 20 times with a warm start coming from the solution with a similar parameter λ . The average computational times are displayed in Figure 10. The abbreviations for the algorithms aresimilar to those of section 4 with the small additions that the number in the brackets denotes the valueof α and IRfGlassoT is the Iterative Ridge for Glasso with a target matrix as proposed and implementedby van Wieringen (2019). Table 1 summarizes the abbreviations of the methods and the availability ofthe implementations. The target matrix T used in these simulation is always the Identity matrix.The plots on the left of Figure 10 are based on using values λ , such that only one connected componentis present. Note that the required λ differs for various α values in the Elastic Net penalty (see Lemma 1). gelnet and dpgelnet are competitive with glasso and dpglasso both in the cold (on top) and inthe warm start case (on the bottom) for α = 0 . α = 1. For small α they slow down in speedand the closed form rope for α = 0 is much faster. However, note here, that the closed form for rope does not allow entry-wise differing λ values which would be possible using our iterative gelnet and dpgelnet algorithms with α = 0. To further illustrate the efficiency of the implementations, thegains of using connected components are shown on the right panel of Figure 10. The parameter λ ischosen such that the largest connected component is of size 50, whenever this is possible, i.e. for allalgorithms except rope , gelnet ( α = 0) and dpgelnet ( α = 0). The timings show that instead of using21 pglasso , where only the inner loop of coordinate descent is implemented in Fortran, it is recommendedto use dpgelnet with α = 1, where both the outer and inner loop are implemented in Fortran, themore efficient general implementation in the style of the glassoFast package is included, and the checkfor connected components is performed prior to computations in order to gain further computationalefficiency. Besides the speedup resulting from solving the problem within several smaller blocks due tothe results on connected components, this setup is also much more sparse (due to a larger value of thetuning parameter λ ) for Elastic Net based estimators, which leads to considerable speedups comparedto the left panel. In particular, Elastic Net based sparse estimators (with α = 0 . α = 1) are evenfaster in this case than the closed form solution for rope .The only competitor that is able to incorporate a target matrix into the Graphical Lasso is theIterative Ridge for Glasso of van Wieringen (2019). The available implementation of the latter approachis limited to α = 1, and what is even more critical is that its speed is orders of magnitude behind allof our presented algorithms. In particular the Iterative Ridge for Glasso approach is much slower thanour gelnet algorithm with target matrices, with the latter also enabling targets in the full range of α ∈ [0 , glassoFast implementation of the glasso algorithm is considerablyfaster than the one from the glasso package. Our gelnet implementation (for α = 1) is approximately asfast as the glassoFast .Table 1: Abbreviations of the implementations used for Figure 10 abbreviation problem, algorithm implementation (function, package)
Rope α = 0, closed form rope, GLassoElnetFast
Glasso α = 1, glasso glasso, glasso GlassoFast α = 1, glasso glassoFast, glassoFast Gelnet(1) α = 1, gelnet gelnet, GLassoElnetFast
Gelnet(0.5) α = 0 . gelnet gelnet, GLassoElnetFast
Gelnet(0) α = 0, gelnet gelnet, GLassoElnetFast
DPGlasso α = 1, dpglasso dpglasso, dpglasso DPGelnet(1) α = 1, dpgelnet dpgelnet, GLassoElnetFast
DPGelnet(0.5) α = 0 . dpgelnet dpgelnet, GLassoElnetFast
DPGelnet(0) α = 0, dpgelnet dpgelnet, GLassoElnetFast
GelnetT(1) α = 1 with target, gelnet gelnet, GLassoElnetFast
GelnetT(0.5) α = 0 . gelnet gelnet, GLassoElnetFast
IRfGlassoT α = 0 . RfGlassoTGelnetT(0.5)GelnetT(1)DPGelnet(0)DPGelnet(0.5)DPGelnet(1)DPGlassoGelnet(0)Gelnet(0.5)Gelnet(1)GlassoGlassoFastRope 1 100 10000Computational time (in milliseconds)10 1000 IRfGlassoTGelnetT(0.5)GelnetT(1)DPGelnet(0)DPGelnet(0.5)DPGelnet(1)DPGlassoGelnet(0)Gelnet(0.5)Gelnet(1)GlassoGlassoFastRope 1 100 10000Computational time (in milliseconds)10 1000IRfGlassoTGelnetT(0.5)GelnetT(1)DPGelnet(0)DPGelnet(0.5)DPGelnet(1)DPGlassoGelnet(0)Gelnet(0.5)Gelnet(1)GlassoGlassoFastRope 1 100 10000Computational time (in milliseconds)10 1000 IRfGlassoTGelnetT(0.5)GelnetT(1)DPGelnet(0)DPGelnet(0.5)DPGelnet(1)DPGlassoGelnet(0)Gelnet(0.5)Gelnet(1)GlassoGlassoFastRope 1 100 10000Computational time (in milliseconds)10 1000
Figure 10:
Average computation times for the 100-dimensional correlation matrix from the
FHT dataset with a fixed penalty parameter with a cold starts (on the top in blue) or a warm start,coming from a similar penalty parameter (on the bottom in red). The plots on the right used λ such that Θ disconnected into connected components of maximum size 50, whereas λ of the plotson the left is chosen such that Θ stayed fully connected. For the explanations of the names, seeTable 1. By combining the best parts of various available implementations, our software often improves previousDP-Graphical Lasso implementations considerably, and it is also as fast, or in some scenarios even fasterthan existing efficient Graphical Lasso implementations. Moreover, our software generalizes beyond thecase α = 1 to incorporate Elastic Net penalties and the Graphical Elastic Net variant also allows toinclude a diagonal target matrix. Computation with Elastic Net penalties takes typically slightly longerthan with the L -penalty only. Our implementation for target matrices is also efficient, and it is ordersof magnitude faster than the competing but ad-hoc Iterative Ridge algorithm of van Wieringen (2019). We consider Elastic Net type penalization for precision matrix estimation based on the Gaussian log-likelihood. To resolve this task, we propose two novel algorithms gelnet and dpgelnet . They substan-tially build on earlier algorithmic work for the glasso for precision matrix estimation (Friedman et al.,2008; Mazumder and Hastie, 2012c), the Elastic Net for regression (Zou and Hastie, 2005) and the inclu-sion of target matrices to encode some prior information towards which the estimator is shrunken (vanWieringen and Peeters, 2016; Kuismin et al., 2017). Advantages of our new work include methodological23xtensions enabling higher flexibility for data analysis, the important option for including diagonal targetmatrices into the estimation methods, user-friendly software, efficient implementation and the possibilityto choose different methodological versions within the same software package. dpgelnet optimizes over the desired precision matrix in each block update, whereas gelnet workswith the covariance matrix, the inverse of the precision matrix. There is no overall advantage of usingone over the other algorithm and computational performance gains depend on the true underlying signal.All our algorithms are implemented efficiently in the R -package GLassoElnetFast using a combination of R and Fortran code. They are competitive in terms of computational times with competing methods.There seems to be no overall winner in terms of statistical performance. The inclusion of an L -norminto the Elastic Net penalty encourages additional stability, especially in presence of highly correlatedvariables. Moreover, our simulations show that target matrices and not penalizing the diagonal canbe powerful tools to improve the estimation of precision matrices. Our new software in the R -package GLassoElnetFast and its algorithmic methodology provide a unifying tool to use various modifications ofthe popular glasso algorithm.
Acknowledgement
Solt Kovács and Peter Bühlmann have received funding from the European Research Council (ERC)under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No.786461 CausalStats - ERC-2017-ADG).
References
Allen, G. I. (2010).
Transposable regularized covariance models with applications to high-dimensionaldata . PhD thesis, Stanford University, Department of Statistics.Atchadé, Y. F., Mazumder, R., and Chen, J. (2015). Scalable computation of regularized precisionmatrices via stochastic optimization. arXiv:1509.00426 .Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). Model selection through sparse maximumlikelihood estimation for multivariate Gaussian or binary data.
Journal of Machine Learning Research ,9:485–516.Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices.
The Annals ofStatistics , 36(1):199–227.Cai, T., Liu, W., and Luo, X. (2011). A constrained l minimization approach to sparse precision matrixestimation. Journal of the American Statistical Association , 106(494):594–607.Cai, T. T., Liu, W., and Zhou, H. H. (2016). Estimating sparse precision matrix: optimal rates ofconvergence and adaptive estimation.
The Annals of Statistics , 44(2):455–488.Candès, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much larger thann.
The Annals of Statistics , 35(6):2313–2351.Danaher, P., Wang, P., and Witten, D. M. (2012). The joint graphical lasso for inverse covarianceestimation across multiple classes.
Journal of the Royal Statistical Society, Series B , 76(2):373–397.Fan, J., Feng, Y., and Wu, Y. (2009). Network exploration via the adaptive lasso and scad penalties.
The Annals of Applied Statistics , 3(2):521–541.Friedman, J., Hastie, T., Höfling, H., and Tibshirani, R. (2007). Pathwise coordinate optimization.
TheAnnals of Applied Statistics , 1(2):302–332.Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with thegraphical lasso.
Biostatistics , 9(3):432–441. 24riedman, J., Hastie, T., and Tibshirani, R. (2019).
Graphical Lasso: Estimation of Gaussian GraphicalModels . glasso R package version 1.11.Guo, J., Levina, E., Michailidis, G., and Zhu, J. (2011). Joint estimation of multiple graphical models.
Biometrika , 98(1):1–15.Hastie, T., Tibshirani, R., and Wainwright, M. (2015).
Statistical Learning with Sparsity: The Lasso andGeneralizations . Chapman & Hall/CRC.Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. (2014). QUIC: Quadratic approximationfor sparse inverse covariance estimation.
Journal of Machine Learning Research , 15:2911–2947.Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., Ravikumar, P. K., and Poldrack, R. (2013). BIG & QUIC: Sparseinverse covariance estimation for a million variables. In
Advances in Neural Information ProcessingSystems 26 , pages 3165–3173.Jankova, J. and van de Geer, S. (2015). Confidence intervals for high-dimensional inverse covarianceestimation.
Electronic Journal of Statistics , 9:1205–1229.Kovács, S., Li, H., Bühlmann, P., and Munk, A. (2020). Seeded binary segmentation: a general method-ology for fast and optimal change point detection. arXiv:2002.06633 .Kovács, S., Li, H., Haubner, L., Munk, A., and Bühlmann, P. (2020). Optimistic search strategy: changepoint detection for large-scale data via adaptive logarithmic queries. arXiv:2010.10194 .Kuismin, M. O., Kemppainen, J. T., and Sillanpää, M. J. (2017). Precision matrix estimation withROPE.
Journal of Computational and Graphical Statistics , 26(3):682–694.Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation.
The Annals of Statistics , 37(6):4254–4278.Lauritzen, S. L. (1996).
Graphical models . Oxford statistical science series. Oxford University Press.Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices.
Journal of Multivariate Analysis , 88(2):365–411.Liu, H. and Wang, L. (2017). TIGER: A tuning-insensitive approach for optimally estimating Gaussiangraphical models.
Electronic Journal of Statistics , 11(1):241–294.Londschien, M., Kovács, S., and Bühlmann, P. (2020). Change point detection for graphical models inthe presence of missing values.
Journal of Computational and Graphical Statistics, to appear .Marlin, B. M. and Murphy, K. P. (2009). Sparse Gaussian graphical models with unknown block structure.In
Proceedings of the 26th International Conference on Machine Learning , pages 705–713.Mazumder, R. and Hastie, T. (2012a). dpglasso: Primal Graphical Lasso . dpglasso R package version1.0.Mazumder, R. and Hastie, T. (2012b). Exact covariance thresholding into connected components forlarge-scale graphical Lasso.
Journal of Machine Learning Research , 13(1):781–794.Mazumder, R. and Hastie, T. (2012c). The graphical Lasso: new insights and alternatives.
ElectronicJournal of Statistics , 6:2125–2149.Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with thelasso.
The Annals of Statistics , 34(3):1436–1462.Peeters, C., Bilgrau, A., and van Wieringen, W. (2020). rags2ridges: Ridge Estimation of PrecisionMatrices from High-Dimensional Data . rags2ridges R package version 2.2.3.Peng, J., Wang, P., Zhou, N., and Zhu, J. (2009). Partial correlation estimation by joint sparse regressionmodels.
Journal of the American Statistical Association , 104(486):735–746.25avikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. (2011). High-dimensional covariance es-timation by minimizing L1-penalized log-determinant divergence.
Electronic Journal of Statistics ,5:935–980.Rothman, A. J. (2012). Positive definite estimators of large covariance matrices.
Biometrika , 99(3):733–740.Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariant covarianceestimation.
Electronic Journal of Statistics , 2:494–515.Scheinberg, K., Ma, S., and Goldfarb, D. (2010). Sparse inverse covariance selection via alternatinglinearization methods. In
Advances in Neural Information Processing Systems 23 , pages 2101–2109.Shan, L. and Kim, I. (2018). Joint estimation of multiple Gaussian graphical models across unbalancedclasses.
Computational Statistics & Data Analysis , 121:89–103.Städler, N. and Bühlmann, P. (2012). Missing values: sparse inverse covariance estimation and anextension to sparse regression.
Statistics and Computing , 22(1):219–235.Sustik, M. A. and Calderhead, B. (2012).
GLASSOFAST: An efficient GLASSO implementation . UTCSTechnical Report TR-12-29:1-3.Sustik, M. A., Calderhead, B., and Clavel, J. (2018). glassoFast: Fast Graphical LASSO . glassoFast Rpackage version 1.0.Tan, K. M., Witten, D., and Shojaie, A. (2015). The cluster graphical lasso for improved estimation ofGaussian graphical models.
Computational Statistics & Data Analysis , 85:23–36.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the Royal StatisticalSociety, Series B , 58(1):267–288.Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization.
Journal of Optimization Theory and Applications , 109(3):475–494.van Wieringen, W. N. (2019). The generalized ridge estimator of the inverse covariance matrix.
Journalof Computational and Graphical Statistics , 28(4):932–942.van Wieringen, W. N. and Peeters, C. F. (2016). Ridge estimation of inverse covariance matrices fromhigh-dimensional data.
Computational Statistics & Data Analysis , 103:284–303.Wille, A., Zimmermann, P., Vranová, E., et al. (2004). Sparse graphical gaussian modeling of theisoprenoid gene network in arabidopsis thaliana.
Genome Biology , 5(11).Witten, D. M., Friedman, J. H., and Simon, N. (2011). New insights and faster computations for thegraphical Lasso.
Journal of Computational and Graphical Statistics , 20(4):892–900.Yang, Y. and Zou, H. (2017). gcdnet: LASSO and Elastic Net (Adaptive) Penalized Least Squares, LogisticRegression, HHSVM, Squared Hinge SVM and Expectile Regression using a Fast GCD Algorithm .gcdnet R package version 1.0.5.Yuan, M. (2010). High-dimensional inverse covariance matrix estimation via linear programming.
Journalof Machine Learning Research , 11:2261–2286.Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model.
Biometrika ,94(1):19–35.Zhou, S., Rütimann, P., Xu, M., and Bühlmann, P. (2011). High-dimensional covariance estimation basedon Gaussian graphical models.
Journal of Machine Learning Research , 12:2975–3026.Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of theRoyal Statistical Society, Series B , 67(2):301–320.26
Appendix
A.1 Details on target matrices
We discuss the inclusion of a positive semi-definite (diagonal) target matrix T into the Graphical ElasticNet problem. We aim to solve problem (1) which we recall here:ˆ Θ ( λ, α, T ) = argmin Θ (cid:31) {− log det Θ + tr( SΘ ) + λ ( α k Θ − T k + − α k Θ − T k ) } . This optimization problem is difficult to solve efficiently in general. We propose to simplify the problemby considering only positive semi-definite diagonal target matrices. When considering diagonal targetmatrices with non-negative entries, the normal equations for the diagonal entries are different, while forthe other entries the normal equations remain the same. For each diagonal entry three cases can occur,leading to the following normal equations:Case 1: θ > t , then w = s + λα + λ (1 − α )( θ − t ) . Case 2: θ = t , then w = s + λαu, where u ∈ [ − , . Case 3: θ < t , then w = s − λα + λ (1 − α )( θ − t ) . Note that in the traditional case when zero is the target case 1 occurs always. In the general case,however, one cannot automatically update diagonal elements based on case 1. The difficulty is that onedoes not know in advance the precision matrix Θ and hence, whether updates corresponding to case 1, 2or 3 would be suitable for the diagonal entries. This decision has to be made “on the fly”. In the followingwe derive the technical modifications on how to modify the gelnet algorithm. Overall, in contrast tothe algorithm without target, the initialization and updating for the diagonal entries are different. Special Case.
In the special case, where all off-diagonal elements of S are less or equal to λα , thesolutions Θ and W are by the results on connected components (see Theorem 1) diagonal matrices andthus θ = w for each diagonal entry. If α = 1 then w = s + λ and thus θ = s + λ . We considernow α = 1. By the condition that θ > w > θ > t ≥ θ = w = s + λα + λ (1 − α )( θ − t ) > s + λα . Therefore t < θ < s + λα . In other words case 1 is fulfilled if t ∈ [0 , s + λα ).Case 2: θ = t ≥ t = θ = w = s + λαu, where u ∈ [ − , λα < s .Then t = s + λαu , which is equivalent to t ∈ [ s + λα , s − λα ]. On the other hand if λα ≥ s , then t ∈ [ s + λα , ∞ ) follows, since the target cannot be negative.Case 3: θ < t ≥ θ = w = s − λα + λ (1 − α )( θ − t ) < s − λα . Therefore, if λα ≥ s then case 3 will never be fulfilled. Else t > θ > s − λα . In other words case 3 is fulfilled if λα < s and t ∈ ( s − λα , ∞ ).Hence, for this special case, one can get the exact values of θ by first determining with t whichcase occurs. If θ = t a quadratic equation has to be solved. The values for w follow by w = θ .Case 1: θ = − ( s + λα − λ (1 − α ) t ) + p ( s + λα − λ (1 − α ) t ) + 4 λ (1 − α )2 λ (1 − α ) (15)Case 2: θ = t θ = − ( s − λα − λ (1 − α ) t ) + p ( s − λα − λ (1 − α ) t ) + 4 λ (1 − α )2 λ (1 − α ) General Case.
Note that in the general case, where the off-diagonals of S are no longer smaller orequal to λα the special case still gives some intuition. Namely, for small values in the target case 1 ispresent and the larger the values of the target get, the more likely case 3 is the suitable one. As weassume that the target matrix is diagonal, and hence has non-zero elements only at the diagonal entries,the core of the gelnet algorithm stays the same. The changes to be made are the following:• Initialize such that both W and Θ are positive semi-definite and preferably the suitable diagonalcase is chosen.• After solving the quadratic programs, update such that the diagonal entries can switch cases ifneeded. Updates.
The updates of ˆ w and ˆ θ (Steps 2 d and f in Algorithm 3) differ from the gelnet algo-rithm without target. To ensure that ˆ w and ˆ θ fall into the same case a simultaneous update of bothis needed. In practice, the following updates work for realistic targets, but may not converge for targetswith very large diagonal entries. However, such targets with overly large diagonal entries are typicallyanyway not desirable from a statistical perspective.Using the relation from Step 2d in Algorithm 3 w can be expressed as w = θ + w β . Substitut-ing this expression into the diagonal normal equation leads to λα sgn( θ − t ) = θ + w β − s − λ (1 − α )( θ − t ) . (16)Define the function F , which represents λα sgn( θ − t ) as: F ( θ ) := θ + w β − s − λ (1 − α )( θ − t ) . Note the range of values for F is [ − λα, λα ]. Define for θ = t the function value F t as F t := F ( t ) = t + w β − s . Since F is strictly decreasing in θ ≥
0, we have that: F ( θ ) < F t for θ > t and F ( θ ) > F t for θ < t .By the representation above we need for θ > t that F ( θ ) = λα and for θ < t that F ( θ ) = − λα .Considering F t ∈ [ − λα, λα ], we show by contradiction that θ = t .Assume θ > t then λα = F ( θ ) < F t ≤ λα .On the other hand if θ < t then − λα = F ( θ ) > F t ≥ − λα .By similar arguments one can show that if F t > λα , we have that θ > t and lastly if F t < − λα , wehave that θ < t .Therefore, the value of F t determines the case to be considered and can be seen as a “test”. We need tosolve equation (16) in θ dependent on the case chosen.Case 1: Solve the following for θ :0 = λ (1 − α ) θ + ( s + λα − λ (1 − α ) t − w β ) θ − . Then update w = s + λα + λ (1 − α )( θ − t ).Case 2: θ = t , w = s + F t .Case 3: Solve the following for θ :0 = λ (1 − α ) θ + ( s − λα − λ (1 − α ) t − w β ) θ − . Then update w = s − λα + λ (1 − α )( θ − t ).28 nitialization. Assume α >
0. As in gelnet , the diagonal entries of Θ are produced first and then W is taken as W = S + λα Γ + λ (1 − α )( Θ − T ). This time Γ is the diagonal matrix with γ ii = 1 if t ii ∈ [0 , s ii + λα ) and 0 else. If t ii ∈ [0 , s ii + λα ) then define θ ii as in (15) else set θ ii = t ii . The Final Algorithm.
In the previous paragraphs we discussed the necessary technical modificationsregarding updates and initialization for the gelnet algorithm to incorporate diagonal target matrices.These are again summarized in Algorithm 5. Note again that for very large entries in the target matrix(which are usually beyond what is reasonable from a statistical perspective) our chosen updates might faceconvergence issues when used with the gelnet algorithm. For none of the targets we used in simulations(see section 2.1), we experienced convergence issues. These reasonable targets are “conservative”, i.e.typically not having overly large entries.
Algorithm 5 gelnet algorithm with diagonal target matrix T Initialize Θ init and W init as described in the Initialization paragraph above. Cycle around the columns repeatedly, performing the following steps till convergence: a: Rearrange the rows/columns so that the target column is last (implicitly). b: Solve the Elastic Net regression problem (9) with coordinate descent to get ˆ β . As warm start for β use the solution from the previous round for this row/column. c: Update ˆ w = W ˆ β , ˆ w = ˆ w T . d: Calculate the test statistic F t := F ( θ ) as in the Updates paragraph to determine the case. e: Update ˆ θ according to the case. f: Update ˆ θ = − ˆ θ ˆ β , ˆ θ = ˆ θ T . g: Update ˆ w = s + λα + λ (1 − α )ˆ θ22