[PDF] Graphical Elastic Net and Target Matrices: Fast Algorithms and Software for Sparse Precision Matrix Estimation

Abstract

We consider estimation of undirected Gaussian graphical models and inverse covariances in high-dimensional scenarios by penalizing the corresponding precision matrix. While single L_1 (Graphical Lasso) and L_2 (Graphical Ridge) penalties for the precision matrix have already been studied, we propose the combination of both, yielding an Elastic Net type penalty. We enable additional flexibility by allowing to include diagonal target matrices for the precision matrix. We generalize existing algorithms for the Graphical Lasso and provide corresponding software with an efficient implementation to facilitate usage for practitioners. Our software borrows computationally favorable parts from a number of existing packages for the Graphical Lasso, leading to an overall fast(er) implementation and at the same time yielding also much more methodological flexibility.

Full PDF

GGraphical Elastic Net and Target Matrices: Fast Algorithms andSoftware for Sparse Precision Matrix Estimation

Solt Kovács , Tobias Ruckstuhl , Helena Obrist , Peter Bühlmann Seminar for Statistics, ETH Zurich, Switzerland

January 7, 2021

Abstract

We consider estimation of undirected Gaussian graphical models and inverse covariances in high-dimensional scenarios by penalizing the corresponding precision matrix. While single L (GraphicalLasso) and L (Graphical Ridge) penalties for the precision matrix have already been studied, wepropose the combination of both, yielding an Elastic Net type penalty. We enable additional ﬂexibil-ity by allowing to include diagonal target matrices for the precision matrix. We generalize existingalgorithms for the Graphical Lasso and provide corresponding software with an eﬃcient implementa-tion to facilitate usage for practitioners. Our software borrows computationally favorable parts froma number of existing packages for the Graphical Lasso, leading to an overall fast(er) implementationand at the same time yielding also much more methodological ﬂexibility. Keywords:

Gaussian graphical model;

GLassoElnetFast

R-package; Graphical Lasso; High-correlation;High-dimensional; ROPE; Sparsity.

The estimation of precision matrices (the inverse of covariance matrices) in high-dimensional settings,where traditional estimators perform poorly or do not even exist (e.g. the inverse of the sample covariancematrix), has developed considerably over the past years. For multivariate Gaussian observations (withmean vector µ and covariance matrix Σ ) a zero oﬀ-diagonal entry of the precision matrix Σ − encodesconditional independence of the two corresponding variables given all the other ones. The resultingconditional independence graph, with variables as nodes and edges for nonzero oﬀ-diagonal entries, iscalled a Gaussian graphical model (GGM, Lauritzen, 1996).Meinshausen and Bühlmann (2006) proposed a nodewise regression approach (selecting one variableas the response while taking all remaining ones as predictors and repeating this for each variable) usingthe Lasso (Tibshirani, 1996). The zero entries in the estimated Lasso regression coeﬃcients serve as anestimate of the graphical model, i.e. the zero pattern, but it does not give a full estimate of the precisionmatrix. To overcome this problem, Zhou et al. (2011) proposed to use an unpenalized maximum likelihoodestimator for the covariance matrix based on the previously estimated graph. Yuan (2010) used theDantzig selector (Candès and Tao, 2007) for regression instead of the Lasso to get a preliminary estimate,followed by solving a linear program to obtain a symmetric matrix which is close to the preliminaryestimate (even being positive deﬁnite with high probability).A diﬀerent approach for estimating precision matrices was proposed by Yuan and Lin (2007) basedon the maximization of an L -penalized Gaussian log-likelihood for positive deﬁnite matrices. The L -penalty for the precision matrix encourages sparsity and thus the zero pattern of the estimated precisionmatrix serves as a direct estimate of the underlying GGM. Early proposals for the (challenging) computa-tion of the estimates have been given by Yuan and Lin (2007), Banerjee et al. (2008) as well as using theGraphical Lasso ( glasso ) by Friedman et al. (2008). The glasso algorithm has been widely used in thepast and thus nowadays the L -penalized Gaussian log-likelihood precision matrix estimation problemitself is also called the Graphical Lasso (or glasso ). Other recent approaches for the computation of the1 a r X i v : . [ s t a t . M E ] J a n raphical Lasso estimator include for example those of Scheinberg et al. (2010); Hsieh et al. (2013, 2014)and Atchadé et al. (2015). Theoretical properties of the Graphical Lasso estimator were investigated byYuan and Lin (2007); Rothman et al. (2008); Lam and Fan (2009) and Ravikumar et al. (2011) (up to asmall diﬀerence of whether to penalize the diagonal elements of the precision matrix).The Graphical Lasso has been modiﬁed and adapted to several more speciﬁc scenarios, e.g. in thepresence of missing data (Städler and Bühlmann, 2012), change points (Londschien et al., 2020), unknownblock structure (Marlin and Murphy, 2009), in joint estimation proposals in the presence of groups sharingsome information (Guo et al., 2011; Danaher et al., 2012; Shan and Kim, 2018), with a prior clusteringof the variables (Tan et al., 2015), or even for conﬁdence intervals based on a desparsiﬁed GraphicalLasso estimator (Jankova and van de Geer, 2015). While L -penalized estimation for GGMs is verypopular, numerous other approaches have been proposed as well, e.g. SCAD type penalty (Fan et al.,2009), the CLIME estimator (Cai et al., 2011), the SPACE estimator for partial correlation estimation(Peng et al., 2009), banded estimation (Bickel and Levina, 2008) or shrinkage-type approaches (Ledoitand Wolf, 2004). Let X i ∈ R p for i = 1 , . . . , n be i.i.d. p -dimensional Gaussian random vectors with mean µ ∈ R p and positive deﬁnite covariance matrix Σ ∈ R p × p with corresponding precision matrix Θ = Σ − . Let X ∈ R n × p be the data matrix comprising of the n observations, i.e. X = ( X , . . . , X n ) > . Without loss ofgenerality assume that the columns of X are centered, such that the sample covariance matrix is given by S = X T X /n . Furthermore, for a given matrix A ∈ R p × p let the matrix norms k·k and k·k be deﬁnedas k A k = P pi,j =1 | A ij | and k A k = k A k F = qP pi,j =1 | A ij | = p tr( A T A ) and similarly let k·k − and k·k − be the norms without the diagonal entries, i.e. k A k − = P i = j | A ij | and k A k − = qP i = j | A ij | .Let diag( A ) be the diagonal matrix with the diagonal elements of the matrix A as its entries. Finally,denote by A (cid:31) A is positive deﬁnite. Consider the following type of estimation problemfor estimating the unknown precision matrix Θ based on the n observations:ˆ Θ ( λ, α, T ) = argmin Θ (cid:31) {− log det Θ + tr( SΘ ) + λ ( α k Θ − T k + − α k Θ − T k ) } , (1)where T ∈ R p × p is a known (typically diagonal) positive semi-deﬁnite target matrix, λ ≥ α ∈ [0 , k Θ − T k − and k Θ − T k − and instead of the scalar λ one could further generalize the estimation byallowing entry-wise penalties. For the ease of reading, we will focus on the more restrictive formulation inequation (1) throughout the paper, but we enable the option for entry-wise penalties in our software. Notethat due to the L -penalty term, the scaling of the variables matters. Thus, it matters whether as input S , the sample covariance or sample correlation matrix is provided. The letter would be recommended inpractice in most scenarios.The formulation (1) encompasses several previously proposed high-dimensional precision matrix es-timators. The most classical one is the Graphical Lasso estimator (Friedman et al., 2008), occurring inthe case of α = 1 and for the target matrix being the zero matrix. The L -penalty (i.e., α = 0) wasrecently proposed independently by van Wieringen and Peeters (2016) and Kuismin et al. (2017). Thelatter authors refer to it as the Ridge type operator for precision matrix estimation ( rope ) with diﬀerenttarget matrices, while van Wieringen and Peeters (2016) called it the Alternative Ridge Precision esti-mator, with Type I for the case of zero as target matrix and Type II for nonzero positive deﬁnite targetmatrix. For the L -penalized estimators, closed form solutions exist (even with target matrices), unlikefor the Graphical Lasso, where various iterative procedures for the computations have been proposed, asmentioned earlier. The only proposal we are aware of incorporating target matrices for the L -penaltyis that of van Wieringen (2019). He is iteratively applying his generalized Ridge estimator (with ele-mentwise diﬀering λ values) to approximate the loss function for the L case. However, when aiming forreasonably accurate estimates, this approach is computationally not attractive even for only moderatelysized problems. Now let us mention the combination of L and L -penalties, i.e., the case α ∈ (0 , T = ). Genevera Allen in her 2010Stanford University PhD thesis (Allen, 2010, Algorithm 2) adapted the glasso algorithm of Friedmanet al. (2008), similar to our proposal in section 3.2. Atchadé et al. (2015) proposed stochastic proximaloptimization methods to obtain near-optimal (i.e., approximate) solutions for regularized precision ma-trix estimation, and their approach incorporates also Elastic Net penalties (without target). They focuson large-scale problems, where the computation of exact solutions becomes impractical. Lastly, Rothman(2012) considers a sparse covariance matrix estimator and his proposed algorithm bears resemblance toour algorithms for Elastic Net type penalties. We are not aware of publicly available software corre-sponding to these proposals with Elastic Net penalties, moreover, eﬃcient software for target matrices islimited to the Ridge penalty. We focus on the estimation problem (1). Elastic Net type penalties (for Gaussian log-likelihood basedprecision matrix estimation) could help to obtain both the advantages of L -penalized estimation (aspresented recently by van Wieringen and Peeters (2016) and Kuismin et al. (2017)) as well as beneﬁts of L -penalization such as sparse precision matrix estimates that are desired for graphical models. Moreover,the inclusion of suitable target matrices could considerably improve estimation results. However, asdiscussed previously, only special cases have been treated so far in the literature both in terms of availablealgorithms as we well as corresponding software. Our goal is to contribute to computational approachesfor the general problem (1), i.e., to develop new algorithms in order to allow estimation in more ﬂexibleways, and also to provide software with eﬃcient implementations of these proposals for practitioners. Apractice oriented introduction to targets, a motivating real data example and some code snippets showingthe usage of the package in section 2 are aimed to facilitate the understanding and usage.Our algorithms are building upon the glasso algorithm of Friedman et al. (2008) and the morerecent dpglasso algorithm of Mazumder and Hastie (2012c) that oﬀers certain advantages over itspredecessor. We describe these two algorithms (for the case without target matrices) in section 3.1.We then introduce our modiﬁcations in section 3.2 that lead to our gelnet (Graphical Elastic Net)and dpgelnet algorithms. In section 3.3 we further generalize these approaches to include diagonaltarget matrices. While diagonal matrices are less ﬂexible than arbitrary positive semi-deﬁnite targetmatrices, we note that in practice many fall into this category (e.g. all target matrices that were usedin simulations by van Wieringen and Peeters (2016) and by Kuismin et al. (2017)). Our developedalgorithms are supported by mathematical derivations and theory. We present simulations in section 4comparing our new estimators with Elastic Net penalties and target matrices with the plain GraphicalLasso and also recent Ridge-type approaches. Competitive computational performance is demonstratedin section 5 by benchmarking our software to widely used packages such as the glasso (Friedman et al.,2019), the glassoFast (Sustik et al., 2018), and the dpglasso (Mazumder and Hastie, 2012a) R -packages interms of computational times. We present some target types that can be used within equation (1) and which were used in the simulationsin section 4.• True Diagonal: Taking the diagonal of the underlying true precision matrix. Of course, using thistarget is only possible in simulation settings, where the original precision matrix is known.• Identity: Taking the identity matrix as target. This choice of the target is very conservative (i.e.,having rather small entries) when the input S in problem (1) is the empirical correlation matrix.• v -Identity: Multiplying the identity matrix with a scalar v , where v is the inverse of the mean ofall diagonal entries in the sample covariance matrix (see Kuismin et al., 2017).3 Eigenvalue: The “DAIE” target type from the default.target function of the rags2ridges R -package(Peeters et al., 2020). This takes a diagonal matrix with average of inverse nonzero eigenvalues ofthe sample covariance matrix as entries. Eigenvalues under a certain threshold are set to zero.• Maximal Single Correlation: For variable j take from the remaining variables the one that has thehighest absolute correlation with j and denote this as k . The corresponding correlation is denotedas ρ jk . Then set T jj = ((1 − | ρ jk | ) · S jj ) − , where S jj is the j -th diagonal element of the samplecovariance matrix S .• Nodewise regression (Meinshausen and Bühlmann, 2006): The idea is similar as with the maximalsingle correlation, but we would like to allow for more than one predictor. For a variable j applya 10-fold cross-validation using Lasso regression (using all other variables than the j -th one aspredictors). Then set the j -th diagonal entry of the target matrix as the inverse of the minimalcross-validated error variance. Under suitable conditions, the target matrix diagonals converge tothe diagonal values of the true precision matrix Θ.Some further targets are implemented in the default.target function of the rags2ridges R -package ofPeeters et al. (2020). In order to illustrate diﬀerences between the methodologies, we apply glasso and gelnet (with α = 0 . p = 39 isoprenoidgenes from n = 118 samples from the plant Arabidopsis Thaliana. As there is no solid ground truthavailable, we only brieﬂy show that diﬀerent estimation approaches lead to diﬀerent results and thus theadditional ﬂexibility provided by target matrices and elastic net type penalties provides additional optionsfor real data applications compared to the glasso . In all examples below the penalization parameterfor each algorithms is chosen such that the resulting precision matrix only has 200 non-zero oﬀ-diagonalentries, which leads to 100 edges between the genes. In the ﬁrst example (Figure 1) we take the samplecorrelation matrix for the genes and apply both glasso and gelnet (with α = 0 .

5) without targetmatrix (i.e. T = ).Figure 1: Original data and no target is used (i.e. T = ). The 100 edges chosen by glasso (onthe left) and gelnet with α = 0 . In the second example we adjust for the ﬁrst principal component and then apply the same proce-dure to the new sample correlation matrix. Such an adjustment is common to remove potential hiddenconfounding variables. As shown on the green right plot of Figure 2, in this case fewer edges are incommon. 4igure 2:

Data adjusted for the ﬁrst principal component and no target is used. The 100 edgeschosen by glasso (on the left) and gelnet with α = 0 . Instead of adjusting for the ﬁrst principal component, we use in the third example the

Maximal SingleCorrelation target from subsection 2.1. As shown on the green right plot of Figure 3, in this case evenfewer edges are in common.Figure 3:

Original data and the

Maximal Single Correlation target is used. The 100 edges chosenby glasso (on the left) and gelnet with α = 0 . R -package GLassoElnetFast

In the following we provide a few examples on how to use our software.

Installing the

GLassoElnetFast R -package.

This can be done as follows.

Estimation with no target matrices.

We show here the options for the Graphical Lasso, the Graph-ical Elastic Net and the Rope method (Ridge penalty).5

Estimation with the identity target matrix.

We show here the options for the Graphical Lasso,the Graphical Elastic Net and the Rope method (Ridge penalty).

Estimation with the identity target matrix and using cross-validation.

We show here theoptions for the Graphical Elastic Net.

Further remarks about the gelnet and dpgelnet functions.

They include the following:• Instead of a scalar λ , one can also provide entry-wise penalties via the argument lambda (whichneeds to be a vector or a matrix in this case).6 Besides the estimated precision matrix ( Theta ), also an estimate of the covariance matrix ( W ) isreturned. This can be accessed by e.g. ﬁtGelnet$W .• One can choose whether to penalize the diagonal via the penalize.diagonal argument.• One can provide warm starts (via arguments Theta and W ).• niter , del and conv are also part of the output yielding information about the number of iterations,change in parameter value and convergence ( TRUE or FALSE ) with the set number of iterationsand thresholds. One can set various thresholds (arguments outer.thr and inner.thr ) and iterationnumbers (arguments outer.maxit and inner.maxit ) when aiming for more accurate results or in casealgorithms face convergence issues.Other notable arguments for the gelnet function:• The argument zero allows to specify indices of the precision matrix which are constrained to bezero.• The argument

Target allows to specify a diagonal target matrix. Target matrices mentioned inthis paper are implemented via the function target and some further ones are implemented in the default.target function of the rags2ridges R -package.Further explanations can be obtained in the following help ﬁles: help ( gelnet )

Our implementations in the

GLassoElnetFast R -package build upon the glasso R -package (Friedmanet al., 2019) and the dpglasso R -package Mazumder and Hastie (2012a) by suitably modifying them.Additionally, we also utilized a faster implementation from the glassoFast R -package (Sustik et al., 2018;Sustik and Calderhead, 2012). Moreover, whenever possible, we even improve on the original versions. Forexample, to achieve further speed-ups, we incorporate block diagonal screening (Mazumder and Hastie,2012b; Witten et al., 2011) which was missing in the dpglasso and glassoFast packages. Compared to the dpglasso package, our implementation relies even more on

Fortran , which is beneﬁcial for its speed, whilefor the estimation we also allow the k·k − penalty (instead of k·k only) and entry-wise penalties. Besidessuch technical improvements on existing software (which lead to some computational speed-up as shownin section 5), the main new features are the additional ﬂexibility of allowing Elastic Net penalties anddiagonal target matrices as described earlier. Our implementation is planned to be made available as an R -package on CRAN . glasso and dpglasso to gelnet and dpgelnet glasso and dpglasso In this section we brieﬂy present the glasso algorithm by Friedman et al. (2008) as well as the dpglasso algorithm by Mazumder and Hastie (2012c) and set up some notation which we will rely on when pre-senting in the next subsection 3.2 the modiﬁed versions with the Elastic Net penalties called gelnet and dpgelnet . We closely follow the derivation presented in Mazumder and Hastie (2012c). Both the glasso and the dpglasso algorithm seek to minimize the L -regularized negative log-likelihood over allpositive deﬁnite precision matrices Θ :ˆ Θ ( λ, α = 1 , T = ) = argmin Θ (cid:31) {− log det Θ + tr( SΘ ) + λ k Θ k } . We call the function inside the argmin the

Graphical Lasso loss , which is a special case of (1). We assumethe tuning parameter λ ≥ S required as the input isusually the empirical covariance or correlation matrix. The algorithms work with the normal equationscorresponding to the Graphical Lasso loss: − Θ − + S + λ Γ = , − W + S + λ Γ = , (2)where W := Θ − is called the working covariance matrix and the matrix Γ denotes the component-wisesigns of Θ with Γ i,j ∈ [ − ,

1] for Θ i,j = 0 . The approach used for solving the normal equations is toupdate one dimension (row and column) at a time while leaving the remaining ones ﬁxed. The updaterequires solving a suitable penalized regression problem which can be performed eﬃciently by applyingcoordinate descent (Friedman et al., 2007). After updating, one proceeds with the next dimension (rowand column), until all dimensions have been updated once. Typically one repeats these update cycles forall the variables as long as diﬀerences are above a certain threshold used as a stopping criterion. Notation 1

Partition S , Θ , W and Γ into block form as follows: a matrix with dimensions ( p − × ( p − (e.g. S ), two vectors of length ( p − (e.g. s ) and a scalar (e.g. s ): S = (cid:18) S s s s (cid:19) , Θ = (cid:18) Θ θ θ θ (cid:19) , W = (cid:18) W w w w (cid:19) , Γ = (cid:18) Γ γ γ γ (cid:19) . (3)The main diﬀerence of glasso and dpglasso lies in the fact that glasso is actively working with W ,while dpglasso works with its inverse Θ . Furthermore, glasso and dpglasso use diﬀerent block-matrixrepresentations for W , where properties of inverses of block-partitioned matrices are used and Θ = W − : W Glasso = (cid:18) W w w w (cid:19) = ( Θ − θ θ θ ) − − W θ θ . θ + θ W θ θ ! , (4) W DP Glasso = (cid:18) W w w w (cid:19) =  Θ − − Θ − θ θ Θ − θ − θ Θ − θ − Θ − θ θ − θ Θ − θ . θ − θ Θ − θ  . (5)As the remaining steps of the derivation (e.g. the regression problems involved in the row and columnupdates as well as the coordinate descent procedure to obtain solutions for it) are special cases of what isdiscussed in the next subsection 3.2, we only present the ﬁnal algorithms here. Note that in Algorithm 1(and Algorithm 2) I denotes the identity matrix (of dimension p × p ). Algorithm 1 glasso algorithm Initialize W init = S + λ I . Cycle around the columns repeatedly, performing the following steps till convergence: a: Rearrange the rows/columns so that the currently updated column is last (implicitly). b: Solve the following Lasso problem with coordinate descent to get ˆ β :ˆ β = argmin β ∈ R p − { β T W β + β T s + λ k β k } . As warm start for β use the solution from the previous round for this row/column. c: Update the oﬀ-diagonal of the working covariance matrix as ˆ w = W ˆ β (and similarly for ˆ w ),but do not change the diagonal entry w . d: Save ˆ β for this row/column in a matrix B . Finally, for every row/column, compute the diagonal entries of Θ using ˆ θ = w − ˆ β ˆ w and obtainthe oﬀ-diagonal entries of Θ from the matrix B , where ˆ θ = − ˆ θ ˆ β and ˆ θ = ˆ θ T .8 lgorithm 2 dpglasso algorithm Initialize Θ init = diag( S + λ I ) − and W init = S + λ I . Cycle around the columns repeatedly, performing the following steps till convergence: a: Rearrange the rows/columns so that the currently updated column is last (implicitly). b: Solve the following quadratic program with coordinate descent for γ ∈ R p − :ˆ γ = argmin k γ k ∞ ≤ λ { ( s + γ ) T Θ ( s + γ ) } . c: Update ˆ θ = − Θ ( s +ˆ γ ) w and ˆ θ = ˆ θ T . d: Update ˆ θ = − ( s + ˆ γ ) T ˆ θ w . e: Update ˆw = s + ˆ γ and ˆw = ˆw T , but do not change the diagonal entry w .If the primary interest lies in the estimation of Θ , the dpglasso algorithm oﬀers several advantages.It yields both a sparse and positive deﬁnite Θ , while glasso only ensures a sparse solution, whichis not necessarily positive deﬁnite in all cycles of the algorithm and thus rarely also when stopping.Moreover, dpglasso converges with all positive deﬁnite warm starts Θ w . glasso is guaranteed tomaintain positive deﬁnite updates in each step only if the warm start W w fulﬁlls the conditions W w (cid:31) k W w − S k ∞ ≤ λ (which is the case for example for the standard initialization W init = S + λ I ). Fordetailed results and diﬀerences see Mazumder and Hastie (2012c). gelnet and dpgelnet In this section the modiﬁed versions of the glasso and dpglasso algorithms, named gelnet and dpgelnet are derived. The speciﬁc implementations for gelnet and dpgelnet utilize the shell of the glasso code provided by Friedman et al. (2019) and the dpglasso code provided by Mazumder andHastie (2012a), see section 2.3. Penalizing the negative log-likelihood with a combination of L and L terms leads to the following special case of problem (1):ˆ Θ ( λ, α, T = ) = argmin Θ (cid:31) {− log det Θ + tr( SΘ ) + λ ( α k Θ k + − α k Θ k ) } , (6)where λ is a non-negative tuning parameter, α ∈ [0 ,

1] is another tuning parameter, and S is a positivesemi-deﬁnite matrix (typically the covariance or correlation matrix). Minimizing expression (6) is calledthe Graphical Elastic Net or in short the gelnet problem. The corresponding normal equations are − Θ − + S + λα Γ + λ (1 − α ) Θ = , − W + S + λα Γ + λ (1 − α ) Θ = , (7)where Γ is the sign matrix of Θ and W is the working covariance matrix similar to previous notationfrom (2). The gelnet and dpgelnet algorithms follow the block coordinate descent approach of the glasso and dpglasso algorithms in solving the normal equations (7). Namely, update one row/columnat the time while leaving the remaining ones ﬁxed. The update requires solving another optimizationproblem (for a regression problem), which can be performed eﬃciently by applying coordinate descent(Friedman et al., 2007). After updating, one proceeds with the next row/column update, until all ofthem have been updated once. Typically one repeats these update cycles for all the variables as long asdiﬀerences are above the threshold used as the stopping criterion. Both algorithms work actively with Θ and W simultaneously, in contrast to glasso and dpglasso . The detailed derivation of gelnet and dpgelnet can be split into two problems:• Problem 1) Derive the formula for the row/column updates as well as the therein involved opti-mization problem.• Problem 2) Apply coordinate descent for the optimization from Problem 1).9oordinate descent (Tseng, 2001; Friedman et al., 2007) performs componentwise updates leaving theother coordinates ﬁxed. If the function to be minimized is convex and can be decomposed into a diﬀer-entiable, convex function plus a sum of convex functions of each individual parameter, then iterativelyupdating each coordinate is guaranteed to converge to the global minimizer. For more details, see Tseng(2001); Friedman et al. (2007) or Hastie et al. (2015). The functions to be minimized at each of therow/column updates of gelnet and dpgelnet fulﬁll this property and hence only the componentwiseupdates have to be derived. gelnet Solution to Problem 1)

Consider the p -th row of the normal equations in (7) without the diagonalentry, using notation from (3): − w + s + λα γ + λ (1 − α ) θ = . Use (4) for w and deﬁne β := − θ θ : W θ θ + s + λα γ + λ (1 − α ) θ = , W β − s − λα γ + λ (1 − α ) θ β = . (8)Note that γ ∈ − sign( β ) since θ >

0. Therefore (8) corresponds to the normal equation of the following L and L -regularized quadratic program:minimize β ∈ R p − { β T W β − β T s + λα k β k + λ − α θ β T β } , or equivalently: minimize β ∈ R p − { (cid:13)(cid:13)(cid:13) W / β − W − / s (cid:13)(cid:13)(cid:13) + λα k β k + λ − α θ k β k } . (9)After solving the quadratic program (see Problem 2) below), with minimizer ˆ β , update the entries asfollows:• ˆ w = − W θ θ = W ˆ β using the representation (4) and ˆ β = − θ θ • ˆ θ = w − ˆ β ˆ w by using (4), ˆ β = − θ θ and ˆ w = W ˆ β • ˆ θ = − ˆ θ ˆ β • ˆ w with the normal equations (7) Solution to Problem 2)

We show here how to apply componentwise updates for solving the L and L -regularized quadratic program (9). Using the notations Z = W / , y = W − / s , λ = λα and λ = λ (1 − α ) θ the quadratic program translates to a standard Elastic Net regression problem, i.e., R ( β ) := k y − Z β k + λ k β k + λ k β k needs to be minimized over β ∈ R p − . Let ˜ β k for k = j be estimates and partially optimize R ( β ) withrespect to β j , by computing the gradient at β j = ˜ β j . The gradient only exists if ˜ β j = 0. Without loss ofgenerality assume that ˜ β j >

0. Then: ∂R∂β j (cid:12)(cid:12)(cid:12)(cid:12) β = ˜ β = − p − X i =1 Z i,j ( y i − Z Ti ˜ β ) + λ + λ ˜ β j . With the partial residuals given as r ( j ) i = y i − P k = j Z ik ˜ β k and setting the partial derivative to 0 yields:˜ β j = P p − i =1 Z i,j r ( j ) i − λ P p − i =1 Z i,j + λ .

10 similar derivation can be done for ˜ β j <

0. The case ˜ β j = 0 is treated separately using standardsub-diﬀerential calculus. The overall solution for ˜ β j satisﬁes:˜ β j = S λ (cid:16)P p − i =1 Z i,j r ( j ) i (cid:17)P p − i =1 Z i,j + λ , (10)where S λ ( x ) = sign( x )( | x | − λ ) + is the soft-thresholding operator. Using the inner products Z T Z = W and Z T y = s , equation (10) can be written as.˜ β j = S λ (cid:16) ( s ) j − P k = j ( W ) k,j ˜ β k (cid:17) ( W ) j,j + λ . Putting all these pieces from Problem 1) and Problem 2) together yields the gelnet algorithm (Algorithm3 below).

Algorithm 3 gelnet algorithm Initialize Θ init = diag( S + λα I ) − and W init = S + λα I + λ (1 − α ) Θ init . Cycle around the columns repeatedly, performing the following steps till convergence: a: Rearrange the rows/columns so that the currently updated column is last (implicitly). b: Solve the Elastic Net regression problem (9) with coordinate descent to get ˆ β . As warm start for β use the solution from the previous round for this row/column. c: Update ˆ w = W ˆ β , ˆ w = ˆ w T . d: Update ˆ θ = w − ˆ β ˆ w . e: Update ˆ θ = − ˆ θ ˆ β , ˆ θ = ˆ θ T . f: Update ˆ w = s + λα + λ (1 − α )ˆ θ . dpgelnet Solution to Problem 1)

Consider the p -th row of the normal equations in (7) without the diagonalentry using notation from (3): − w + s + λα γ + λ (1 − α ) θ = . Use (5) for w and w . Then multiply with Θ from the left to obtain Θ − θ θ − θ Θ − θ + s + λα γ + λ (1 − α ) θ = , Θ − w θ + s + λα γ + λ (1 − α ) θ = ,w θ + Θ ( s + λα γ + λ (1 − α ) θ ) = . (11)Consider the case with α ∈ (0 , γ := λα γ as well as ˜ q := abs( θ ) and get the followingKarush-Kuhn-Tucker (KKT) conditions: w λα ˜ q ∗ ˜ γ + Θ ( s + (1 + − αα ˜ q ) ∗ ˜ γ ) = , ˜ q ∗ (abs(˜ γ ) − λα p − ) = , k ˜ γ k ∞ ≤ λα, (12)where ∗ denotes elementwise multiplication. These are equivalent to the following box-constrainedquadratic program for γ ∈ R p − :minimize k γ k ∞ ≤ λα { ( s + (1 + − αα ˜ q ) ∗ γ ) T Θ ( s + (1 + − αα ˜ q ) ∗ γ ) } . (13)In the special case where α = 0, the normal equations (11) simplify to: w θ + Θ ( s + λ θ ) = . q = abs( θ ) to get the following Karush-Kuhn-Tucker (KKT) conditions: w ˜ q ∗ γ + Θ ( s + λ ˜ q ∗ γ ) = , ˜ q ∗ (abs( γ ) − p − ) = , k γ k ∞ ≤ . These are equivalent to the following box-constrained quadratic program for γ ∈ R p − :minimize k γ k ∞ ≤ ( s + λ ˜ q ∗ γ ) T Θ ( s + λ ˜ q ∗ γ ) . After solving the quadratic program (see Problem 2) below) with solution γ ∗ , update for α ∈ (0 ,

1] inthe following way (and with very similar updates for the case α = 0):• ˆ θ = − Θ ( s +(1+ 1 − αα ˜ q ) ∗ γ ∗ ) w using (12) and that ˆ θ = ˜ q ∗ ˜ γ • ˆ θ = w + ˆ θ T Θ − ˆ θ by (5) and then use ˆ θ to get ˆ θ = − ( s +(1+ 1 − αα ˜ q ) ∗ γ ∗ ) T ˆ θ w • ˆ w and ˆ w with the normal equations (7) Solution to Problem 2)

We show here how to apply componentwise updates for solving the boxconstrained quadratic program (13). Taking the derivative of the j -th coordinate with respect to c forthe quadratic program of the form ( a + b ∗ c ) T D ( a + b ∗ c ) leads to:[ ( a + b ∗ c ) T D ( a + b ∗ c )] ( j ) = [ a T Da ] ( j ) + [( b ∗ c ) T Da ] ( j ) + [( b ∗ c ) T D ( b ∗ c )] ( j ) = 0 + ( b ∗ ( Da )) j + ( b ∗ ( D ( b ∗ c ))) j = ( b ∗ ( D ( a + b ∗ c ))) j = b j p X k =1 d j,k ( a k + b k c k ) . Setting b j P pk =1 d j,k ( a k + b k c k ) = 0 one obtains d j,j ( a j + b j c j ) = p X k =1 ,k = j d j,k ( a k + b k c k ) ,c j = − P pk =1 d j,k ( a k + b k c k ) + d j,j b j c j d j,j b j . This leads to the following coordinate update of γ j γ j,new = − P pk =1 ( Θ ) j,k ( s + (1 + − αα ˜ q ) k ∗ γ k,old ) + ( Θ ) j,j (1 + − αα ˜ q ) j ∗ γ j,old ( Θ ) j,j (1 + − αα ˜ q ) j . In the special case where α = 0, γ j,new = − P pk =1 ( Θ ) j,k ( s + λ ( ˜ q ) k ∗ γ k,old ) + ( Θ ) j,j λ ( ˜ q ) j ∗ γ j,old ( Θ ) j,j λ ( ˜ q ) j . Putting all these pieces from Problems 1) and 2) together yields the dpgelnet algorithm. Algorithm 4below shows the procedure for α ∈ (0 ,

1] (which changes only slightly for the case α = 0).12 lgorithm 4 dpgelnet algorithm Initialize Θ init = diag( λα I + S ) − and W init = S + λα I + λ (1 − α ) Θ init . Cycle around the columns repeatedly, performing the following steps till convergence: a: Rearrange the rows/columns so that the currently updated column is last (implicitly). b: Solve (13) and denote the solution as γ ∗ . c: Update ˆ θ = − Θ ( s +(1+ 1 − αα ˜ q ) ∗ γ ∗ ) w . d: Update ˆ θ = − ( s +(1+ 1 − αα ˜ q ) ∗ γ ∗ ) T ˆ θ w . e: Update ˆ w = s + γ ∗ + λ (1 − α ) ˆ θ . f: Update ˆ w = s + λα + λ (1 − α )ˆ θ . Lemma 1

Suppose Θ (cid:31) is used as warm-start for the dpgelnet algorithm. Then every row/columnupdate of dpgelnet maintains the positive deﬁniteness of the working precision matrix Θ . Note that acorresponding lemma for the dpglasso is proven by Mazumder and Hastie (2012c) and hence, we justprovide a simple extension here. Proof 1

Let A = (cid:18) A a a a (cid:19) . The condition A (cid:31) is equivalent to: A (cid:31) and ( a − a ( A ) − a ) > . (14) Consider updating the p -th row/column of the precision matrix. Since the block Θ remains ﬁxed weonly need to show the second condition from (14) . Using the updates of the dpgelnet (Algorithm 4): ˆ θ − ˆ θ T ( Θ ) − ˆ θ = 1 − ( s + (1 + − αα ˜ q ) ∗ γ ∗ ) T ˆ θ w − − ( s + (1 + − αα ˜ q ) ∗ γ ∗ ) T ( Θ ) − Θ ˆ θ w = 1 w = 1 s + λα + λ (1 − α ) θ > . The problem of solving p -dimensional glasso (or dpglasso ) problems can be reduced in some casesto solving several lower-dimensional problems, enabling massive speed ups for the computations (seeMazumder and Hastie, 2012b; Witten et al., 2011). A similar result also holds for gelnet (or dpgelnet ).Following the notation of Mazumder and Hastie (2012b), deﬁne the nodes V = { , ..., p } and the matrix E : E ij = ( ˆΘ ij ( λ, α ) = 0 , i = j, ˆΘ ( λ, α ) is the estimated precision matrix for the input covariance matrix S and the two tuningparameters λ and α , see equation (1). This deﬁnes a symmetric graph G = ( V , E ). Decompose this graphinto its connected components, i.e., G = S ˜ Ll =1 G l where ˜ L is the number of connected components and G l = ( V l , E l ) the l -th sub-graph. Furthermore, deﬁne E as E ij = ( | S ij | > λα, i = j G = ( V , E ). Decompose this graph into its connected componentsas well, i.e., G = S Ll =1 G l where L is the number of connected components and G l = ( V l , E l ). Theorem 1 (taken from Atchadé et al., 2015)

Let G l = ( V l , E l ) , l = 1 , ..., ˜ L and G l = ( V l , E l ) , l =1 , ..., L denote the connected components, as deﬁned above. Then ˜ L = L and there exists a permutation Π on { , ..., L } such that V Π( l ) = V l and E Π( l ) = E l for all l = 1 , ..., L . E ) is com-putationally cheap. Note that parts of ˆΘ corresponding to the diﬀerent connected component can bethen solved independently, i.e. a suitable permutation of Θ and W leads to block-diagonal form. The L blocks are exactly of the sizes of the connected components. This result is especially attractive if themaximum size of the connected components is small compared to p , since the additional eﬀort to computethe connected components is negligible compared to the gains of reducing the problem into smaller sizedproblems. For ﬁxed α = 0 and S the number of connected components is increasing in λ . If λα ≥ S ij for all i, j ∈ V with i = j then Θ and W are diagonal matrices. We incorporated the check of suchconnected components prior to starting actual calculations in our implementation in the GLassoElnetFastR -package.

We now turn towards the inclusion of a positive semi-deﬁnite (diagonal) target matrix T into the Graph-ical Elastic Net problem. Some motivation for this is provided by van Wieringen and Peeters (2016) aswell as Kuismin et al. (2017) in the case of Ridge type penalization. The target can be interpreted asprior knowledge or an educated guess. We aim to solve problem (1) which we recall here:ˆ Θ ( λ, α, T ) = argmin Θ (cid:31) {− log det Θ + tr( SΘ ) + λ ( α k Θ − T k + − α k Θ − T k ) } . This optimization problem is diﬃcult to solve eﬃciently in general. In the L -penalty case, van Wieringen(2019) proposed to iteratively apply his generalized Ridge estimator (with elementwise diﬀering λ ) toapproximate the loss function for the L case. However, this approach is computationally not attractive,even for moderately sized problems (see section 5 for computational times). We propose to simplify theproblem by considering only positive semi-deﬁnite diagonal target matrices. While diagonal matricesare less ﬂexible than arbitrary positive semi-deﬁnite target matrices, we note that in practice many fallinto this category (e.g. all target matrices that were used by van Wieringen and Peeters (2016) and byKuismin et al. (2017)).When considering diagonal target matrices with non-negative entries, the normal equations for thediagonal entries are diﬀerent, while for the other entries they remain the same. For each diagonal entrythree cases can occur, leading to the following normal equations:Case 1: θ > t , then w = s + λα + λ (1 − α )( θ − t ) . Case 2: θ = t , then w = s + λαu, where u ∈ [ − , . Case 3: θ < t , then w = s − λα + λ (1 − α )( θ − t ) . Note that in the traditional setting when zero is the target matrix, then always case 1 occurs. In thegeneral case, however, one cannot automatically update diagonal elements based on case 1. We derive thetechnical modiﬁcations in Appendix A.1 on how to modify the gelnet algorithm. Overall, in contrastto the algorithm without target, the initialization and updating for the diagonal entries require care. Wenote that our chosen updates work for realistic targets, but occasionally may not converge for targetswith very large diagonal entries. However, such targets are not desirable from a statistical perspective.

Our proposed algorithms gelnet and dpgelnet are generalizations of glasso and dpglasso . For theLasso case α = 1, gelnet performs the same updates as glasso and dpgelnet performs the sameupdates as dpglasso . For elastic net penalties, i.e. α ∈ (0 , gelnet and dpgelnet areslightly modiﬁed to incorporate the additional quadratic penalty term. Lemma 1 ensures that dpgelnet maintains positive deﬁnite precision matrices throughout the algorithm for arbitrary positive deﬁniteprecision matrices as warm starts. This feature is appealing when a ﬁt is already available, for example14rom a slightly diﬀerent λ value in cross validation or in change point detection problems from a neigh-bouring split point (see Kovács et al., 2020 and Kovács et al., 2020). gelnet lacks this feature, and forcertain warm starts it may not converge. However, as an advantage for gelnet , we have updates thatallow to incorporate diagonal target matrices, and hence, is more ﬂexible than dpgelnet in its currentlypresented form. As long as single ﬁts are required, (i.e. without warm starts), we recommend gelnet . Ifrepeated ﬁts relying on warms starts are necessary in some application, one can still try to use gelnet with warm starts and in case this faces convergence issues, resort to dpgelnet . In this chapter the statistical performance of the Graphical Elastic Net estimator is compared to its specialcases glasso (Friedman et al., 2008) and rope (Kuismin et al., 2017). Note that van Wieringen andPeeters (2016) also proposed Ridge-type penalization and they called it the Alternative Ridge Precisionestimator (with Type I for the case of zero as target matrix and Type II for nonzero positive deﬁnitetarget matrix). For simplicity, we will use the name rope in the following for such Ridge-type penalizationapproaches.

Simulation models.

The simulation setup is similar to that of Kuismin et al. (2017). We draw n independent realizations from multivariate Gaussian distributions N ( , Σ ), where Σ = [ σ i,j ] ∈ R p × p and Θ = [ θ i,j ] = Σ − are positive deﬁnite matrices coming from 6 diﬀerent models.• Model 1

Compound symmetry model : σ i,i = 1 and σ i,j = 0 . for i = j .Model 2-4 are taken from Liu and Wang (2017). In these models an adjacency matrix A is generatedfrom a graph, where each nonzero oﬀ-diagonal element is set to 0.3 and the diagonal elements to 0. Thenthe smallest eigenvalue Λ min ( A ) is calculated and the corresponding precision matrix is constructed by Θ = D ( A + ( | Λ min ( A ) | + 0 . · I ) D , where D ∈ R p × p is a diagonal matrix with D i,i = 1 for i = 1 , . . . , p and D i,i = 3 for i = p + 1 , . . . , p .• Model 2

Scale-free graph model : The graph begins with an initial small chain graph of 2 nodes.New nodes are added to the graph one at a time. Each new node is connected to one existing nodewith a probability proportional to the number of degrees that the existing node already has.•

Model 3

Hub graph model : The p nodes are evenly partitioned into p disjoint groups with eachgroup containing 10 nodes. Within each group, one node is selected as the hub and edges betweenthe hub and the other 9 nodes are added.• Model 4

Block graph model : Here ˜ Θ is directly produced by making a block diagonal matrixwith block size p , where the oﬀ-diagonal entries are set to 0.5 and diagonal entries to 1. Thematrix is then randomly permuted by rows/columns and the resulting covariance matrix is takenas Σ = D − ˜ Θ − D − , where this time D i,i = 1 for i = 1 , . . . , p and D i,i = 1 . i = p + 1 , . . . , p .Models 5 & 6 are graphical models coming from Cai et al. (2016). First generate the matrix ˜ Θ = [˜ θ i,j ]and then multiply with the inverse of a diagonal matrix D from both sides to get Σ . Each diagonal entryof D is independently generated from a uniform distribution on the interval 1 to 5.• Model 5

Band graph model : ˜ θ i,i = 1, ˜ θ i,i +1 = ˜ θ i +1 ,i = 0 .

6, ˜ θ i,i +2 = ˜ θ i +2 ,i = 0 .

3, ˜ θ i,j = 0 for | i − j | ≥ Model 6

Erdős-Rényi random graph model : Take ˜˜ Θ = [˜˜ θ i,j ], where ˜˜ θ i,j = u i,j · δ i,j , suchthat δ i,j is a Bernoulli random variable with success probability 0.05 and u i,j is a uniform randomvariable on the interval 0.4 to 0.8. Then take ˜ Θ = ˜˜ Θ + ( | Λ( ˜˜ Θ ) min | + 0 . · I .For all models we transform the covariance matrix to be a correlation matrix before simulating thedata. With the exception of Model 1, all models have a sparse structure in Θ . In order to get performancemeasures for the diﬀerent methods, 100 independent simulations for each model are performed and theaverages for several loss functions are calculated. Using ﬁve-fold cross-validation the optimal tuningparameter λ for each method is determined for each simulation run separately.15 erformance measures. The diﬀerent measures used in the simulations can be split into two groups.1) Loss functions used by Kuismin et al. (2017):•

Kullback-Leibler loss:

KL = tr( Σ ˆ Θ ) - log(det( Σ ˆ Θ )) − p • L2 loss:

L2 = k Θ − ˆ Θ k F • Spectral norm loss:

SP = d , where d is the largest eigenvalue of the matrix ( Θ − ˆ Θ ) T ( Θ − ˆ Θ )2) Graph recovery measures: Let ˆ Θ be the estimated solution and Θ the underlying truth. For a giventhreshold (cid:15) deﬁne the adjacency matrices ˆ A and A :ˆ A i,j = ( Θ i,j ≥ (cid:15) A i,j = ( Θ i,j ≥ (cid:15) i < j the edge ij in ˆ A is present if ˆ A i,j = 1, and similar for A . Now deﬁne the followingquantities.• True Positives:

TP = number of edges ij , which are both in ˆ A and in A • True Negatives:

TN = number of edges ij , which are both not in ˆ A and not in A • False Positives:

FP = number of edges ij in ˆ A but not in A • False Negatives:

FN = number of edges ij in A but not in ˆ A Then deﬁne the graph recovery measures.•

F1 score:

F1 = • Matthews correlation coeﬃcient:

MCC = TP × TN - FP × FN √ (TP+FP)(TP+FN)(TN+FP)(TN+FN) Note that both of the latter measures take values in [0 ,

1] with 0 being the worst and 1 being the bestvalue. The threshold (cid:15) in the simulations is ﬁxed as (cid:15) = 10 − . When interpreting later on the resultsshown in Figure 7 and Figure 8, one should keep in mind that the rope algorithm produces non-sparsesolutions and therefore almost no true negatives and false negatives are produced for (cid:15) = 10 − . Moreover,Model 1 is not sparse and thus analyzing the graph recovery measures in this model is not meaningful.16 kkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model1 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model2 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model3 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model4 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model5 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model6 kk kk kk kk kk kk kk kk kk kk kk k

GelnetGelnetEV GelnetFGelnetI GelnetMSCGelnetReg GelnetTrueGelnetvI GlassoGlassoEV GlassoFGlassoI GlassoMSCGlassoReg GlassoTrueGlassovI RopeRopeEV RopeIRopeMSC RopeRegRopeTrue RopevI KL − Dimension 100

Figure 4:

Summary of KL loss for the diﬀerent models and diﬀerent methods based on 100 repli-cations. The columns along the small black dots indicate mean losses. The bars on the top of eachcolumn show the standard deviations (mean ± SD). Plot layout is taken from Kuismin et al. (2017). kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model1 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model2 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model3 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model4 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model5 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model6 kk kk kk kk kk kk kk kk kk kk kk k

Figure 5:

Summary of L2 loss for the diﬀerent models and diﬀerent methods based on 100 repli-cations. The columns along the small black dots indicate mean losses. The bars on the top of eachcolumn show the standard deviations (mean ± SD). Plot layout is taken from Kuismin et al. (2017). kkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model1 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model2 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model3 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model4 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model5 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model6 kk kk kk kk kk kk kk kk kk kk kk k

Figure 6:

Summary of SP loss for the diﬀerent models and diﬀerent methods based on 100 repli-cations. The columns along the small black dots indicate mean losses. The bars on the top of eachcolumn show the standard deviations (mean ± SD). Plot layout is taken from Kuismin et al. (2017). kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model1 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model2 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model3 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model4 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model5 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model6 kk kk kk kk kk kk kk kk kk kk kk k

Figure 7:

Summary of F1 score for the diﬀerent models and diﬀerent methods based on 100replications. The columns along the small black dots indicate mean scores. The bars on the top ofeach column show the standard deviations (mean ± SD). Plot layout is taken from Kuismin et al.(2017). kkkkkkkkkkkkkkkkkkkkkk − − m ean l o ss Model1 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model2 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model3 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model4 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model5 kkkkkkkkkkkkkkkkkkkkkkk m ean l o ss Model6 kk kk kk kk kk kk kk kk kk kk kk k

GelnetGelnetEV GelnetFGelnetI GelnetMSCGelnetReg GelnetTrueGelnetvI GlassoGlassoEV GlassoFGlassoI GlassoMSCGlassoReg GlassoTrueGlassovI RopeRopeEV RopeIRopeMSC RopeRegRopeTrue RopevI

MCC − Dimension 100

Figure 8:

Summary of MCC score for the diﬀerent models and diﬀerent methods based on 100replications. The columns along the small black dots indicate mean scores. The bars on the top ofeach column show the standard deviations (mean ± SD). Plot layout is taken from Kuismin et al.(2017).

Details on the methods used.

Note that due to the L -penalty term, the scaling of the variablesmatters. Thus, as input S , we recommend to use the sample correlation rather than covariance in practice.For the Graphical Elastic Net we ﬁxed α = for the Elastic Net penalty in equation (1) whenever usingit. The simulations were done with and without targets as well as with and without penalizing thediagonal. Notice that in our case the targets are diagonal matrices and not penalizing the diagonalautomatically results in no target. The following abbreviations are used in the ﬁgures, and the targetmatrices mentioned here are described in section 2.1.• F: Setting the penalization parameter to FALSE, i.e. no penalization of the diagonal.• True: Using the True Diagonal target matrix.• I: Using the

Identity target matrix.• vI: Using the v - Identity target matrix.• EV: Using the

Eigenvalue target matrix.• MSC: Using the

Maximal Single Correlation target matrix.• Reg: Using the

Nodewise Regression target matrix.

Simulation results.

The results are displayed in Figures 4, 5, 6, 7 and 8. As the dpgelnet algorithmleads to indistinguishable results compared to the gelnet algorithm, we left it out from the ﬁguresand also in the discussion below. Various performance measures might yield diﬀerent ranking of themethods, such that it is hard to draw an overall conclusion on which method is superior as well as ageneral recommendation on which method or target matrices to use. Nonetheless, we highlight here afew observations based on the results shown in the ﬁgures:• Not penalizing the diagonal of Θ leads to smaller losses than penalizing with zero as a target.19 While the message from Kuismin et al. (2017), that rope with target performs better than glasso with no target, in some of the models holds true, there are some models opposing this statementin general.• In general, it is of advantage to include a target in the algorithms, as each method performs betterif a suitable target is chosen.• The question of how to determine the optimal target matrix is still open. Our simulations showthat the True Diagonal target matrix performs best in terms of L2-Loss and SP-Loss and thusapproaching the diagonal of the underlying precision matrix is desired. However, the

True Diagonal target is not necessarily the winner in terms of

Kullback-Leibler loss .• There is a systematic bias for the estimated diagonal entries of the covariance matrices if diagonalentries are penalized. For example, for the standard glasso algorithm with zero as target matrix,the diagonal of ˆ W is set as ˆ w i,i = s i,i + λ (if the diagonal is penalized). Similarly, bias of thediagonal entries occurs also for Elastic Net penalties and target matrices. In all these cases, onecould try re-scaling the matrix, such that the ˆ w i,i = s i,i . All statements above would still hold, butare often less clearly visible. Including a reasonable target matrix seems to improve the estimation and is thus recommended. Theintuition that the true diagonal (or something close to it) is the best diagonal target does not necessarilyhold. Furthermore, there is no overall winner. Depending on the underlying precision matrix and theconsidered performance measure, diﬀerent target types might be better suited. Hence, we recommend toexplore diﬀerent target types as a tool to gain better insight into the data. The beneﬁts of Elastic Netpenalties are not clearly visible in the considered simulations. Bigger diﬀerences are expected for highlycorrelated variables, analogous to the ﬁndings in regression by Zou and Hastie (2005), where Elastic Netcould potentially lead to more stable estimates.

Our implementations in the

GLassoElnetFast R -package build upon the glasso (Friedman et al., 2019) andthe dpglasso

Mazumder and Hastie (2012a) R -packages by suitably modifying them. Additionally, we alsoutilized a faster implementation from the glassoFast R -package (Sustik et al., 2018; Sustik and Calderhead,2012) that avoids unnecessary copying of subsets of the working covariance matrix when setting up rowand column updates, which causes some ineﬃciency for the glasso package. Moreover, whenever possible,we even improve on the original versions. For example, to achieve further speed-ups, we incorporateblock diagonal screening (Mazumder and Hastie, 2012b; Witten et al., 2011) which was missing in the dpglasso and glassoFast packages. Compared to the dpglasso package, our dpgelnet implementation relieseven more on Fortran , and it avoids unnecessary copying of working covariance matrices (in the style ofthe glassoFast package), which are beneﬁcial for its speed.

Figure 9 compares the computational speed for Graphical Lasso estimation problems (i.e., α = 1) using theimplementations from the GLassoElnetFast , glasso and glassoFast R -packages. Our gelnet implementationin the GLassoElnetFast package is implemented similar to the glassoFast package and hence, similarly fast.However, we additionally included the block-diagonal screening rule of Mazumder and Hastie (2012b);Witten et al. (2011). As shown on the right plot of Figure 9, in case the problem can be decomposedinto connected components (see also Lemma 1), our implementation can be considerably faster. Theremight be problems though where everything (or at least a predominant majority of the nodes) belong tothe same connected components, see the left of Figure 9. In such cases our implementation tends to beslightly slower compared to the glassoFast package, because checking for the connected components hasto be done and additionally slightly more computations are carried out because our implementation can20andle general Elastic Net penalties rather than only ﬁne tuned computations for the Graphical Lassocase of α = 1. computational time (seconds) c o m pu t ed a cc u r a cy f o r W - - - - - - gelnet glasso glassoFast computational time (seconds) c o m pu t ed a cc u r a cy f o r W - - - - gelnet glasso glassoFast Figure 9:

Computed accuracy (maximal absolute error of the oﬀ-diagonal entries) for the workingcovariance matrix W (on a logarithmic scale) vs. average computational times of 10 runs each. Onthe left the input correlation matrix is the one from the ER dataset ( p = 692, available in the QUIC R -package of Hsieh et al., 2014). On the right analogous results are shown for a p = 3460-dimensional problem composed of ﬁve blocks of size 692 each, i.e., the input here is a block diagonalmatrix with ﬁve blocks, where the correlation matrix of the ER dataset was used for each of theﬁve blocks. In all cases a standard Graphical Lasso estimator (i.e., α = 1) with tuning parameter λ = 0 . GLassoElnetFast , glasso and glassoFastR -packages. We introduced new methodology for sparse precision matrix estimation and in the following we wouldlike to demonstrate eﬃcient implementations also for these new proposals. To measure the computationaltime of the diﬀerent methods the

FHT data set available e.g. in the R -package gcdnet (Yang and Zou,2017), is used. It contains n = 50 observations and p = 100 variables, for which we calculate the empiricalcorrelation matrix S ∈ R p × p . For a ﬁxed penalty parameter λ each method computes the solution 20 timeswithout warm start and 20 times with a warm start coming from the solution with a similar parameter λ . The average computational times are displayed in Figure 10. The abbreviations for the algorithms aresimilar to those of section 4 with the small additions that the number in the brackets denotes the valueof α and IRfGlassoT is the Iterative Ridge for Glasso with a target matrix as proposed and implementedby van Wieringen (2019). Table 1 summarizes the abbreviations of the methods and the availability ofthe implementations. The target matrix T used in these simulation is always the Identity matrix.The plots on the left of Figure 10 are based on using values λ , such that only one connected componentis present. Note that the required λ diﬀers for various α values in the Elastic Net penalty (see Lemma 1). gelnet and dpgelnet are competitive with glasso and dpglasso both in the cold (on top) and inthe warm start case (on the bottom) for α = 0 . α = 1. For small α they slow down in speedand the closed form rope for α = 0 is much faster. However, note here, that the closed form for rope does not allow entry-wise diﬀering λ values which would be possible using our iterative gelnet and dpgelnet algorithms with α = 0. To further illustrate the eﬃciency of the implementations, thegains of using connected components are shown on the right panel of Figure 10. The parameter λ ischosen such that the largest connected component is of size 50, whenever this is possible, i.e. for allalgorithms except rope , gelnet ( α = 0) and dpgelnet ( α = 0). The timings show that instead of using21 pglasso , where only the inner loop of coordinate descent is implemented in Fortran, it is recommendedto use dpgelnet with α = 1, where both the outer and inner loop are implemented in Fortran, themore eﬃcient general implementation in the style of the glassoFast package is included, and the checkfor connected components is performed prior to computations in order to gain further computationaleﬃciency. Besides the speedup resulting from solving the problem within several smaller blocks due tothe results on connected components, this setup is also much more sparse (due to a larger value of thetuning parameter λ ) for Elastic Net based estimators, which leads to considerable speedups comparedto the left panel. In particular, Elastic Net based sparse estimators (with α = 0 . α = 1) are evenfaster in this case than the closed form solution for rope .The only competitor that is able to incorporate a target matrix into the Graphical Lasso is theIterative Ridge for Glasso of van Wieringen (2019). The available implementation of the latter approachis limited to α = 1, and what is even more critical is that its speed is orders of magnitude behind allof our presented algorithms. In particular the Iterative Ridge for Glasso approach is much slower thanour gelnet algorithm with target matrices, with the latter also enabling targets in the full range of α ∈ [0 , glassoFast implementation of the glasso algorithm is considerablyfaster than the one from the glasso package. Our gelnet implementation (for α = 1) is approximately asfast as the glassoFast .Table 1: Abbreviations of the implementations used for Figure 10 abbreviation problem, algorithm implementation (function, package)

Rope α = 0, closed form rope, GLassoElnetFast

Glasso α = 1, glasso glasso, glasso GlassoFast α = 1, glasso glassoFast, glassoFast Gelnet(1) α = 1, gelnet gelnet, GLassoElnetFast

Gelnet(0.5) α = 0 . gelnet gelnet, GLassoElnetFast

Gelnet(0) α = 0, gelnet gelnet, GLassoElnetFast

DPGlasso α = 1, dpglasso dpglasso, dpglasso DPGelnet(1) α = 1, dpgelnet dpgelnet, GLassoElnetFast

DPGelnet(0.5) α = 0 . dpgelnet dpgelnet, GLassoElnetFast

DPGelnet(0) α = 0, dpgelnet dpgelnet, GLassoElnetFast

GelnetT(1) α = 1 with target, gelnet gelnet, GLassoElnetFast

GelnetT(0.5) α = 0 . gelnet gelnet, GLassoElnetFast

IRfGlassoT α = 0 . RfGlassoTGelnetT(0.5)GelnetT(1)DPGelnet(0)DPGelnet(0.5)DPGelnet(1)DPGlassoGelnet(0)Gelnet(0.5)Gelnet(1)GlassoGlassoFastRope 1 100 10000Computational time (in milliseconds)10 1000 IRfGlassoTGelnetT(0.5)GelnetT(1)DPGelnet(0)DPGelnet(0.5)DPGelnet(1)DPGlassoGelnet(0)Gelnet(0.5)Gelnet(1)GlassoGlassoFastRope 1 100 10000Computational time (in milliseconds)10 1000IRfGlassoTGelnetT(0.5)GelnetT(1)DPGelnet(0)DPGelnet(0.5)DPGelnet(1)DPGlassoGelnet(0)Gelnet(0.5)Gelnet(1)GlassoGlassoFastRope 1 100 10000Computational time (in milliseconds)10 1000 IRfGlassoTGelnetT(0.5)GelnetT(1)DPGelnet(0)DPGelnet(0.5)DPGelnet(1)DPGlassoGelnet(0)Gelnet(0.5)Gelnet(1)GlassoGlassoFastRope 1 100 10000Computational time (in milliseconds)10 1000

Figure 10:

Average computation times for the 100-dimensional correlation matrix from the

FHT dataset with a ﬁxed penalty parameter with a cold starts (on the top in blue) or a warm start,coming from a similar penalty parameter (on the bottom in red). The plots on the right used λ such that Θ disconnected into connected components of maximum size 50, whereas λ of the plotson the left is chosen such that Θ stayed fully connected. For the explanations of the names, seeTable 1. By combining the best parts of various available implementations, our software often improves previousDP-Graphical Lasso implementations considerably, and it is also as fast, or in some scenarios even fasterthan existing eﬃcient Graphical Lasso implementations. Moreover, our software generalizes beyond thecase α = 1 to incorporate Elastic Net penalties and the Graphical Elastic Net variant also allows toinclude a diagonal target matrix. Computation with Elastic Net penalties takes typically slightly longerthan with the L -penalty only. Our implementation for target matrices is also eﬃcient, and it is ordersof magnitude faster than the competing but ad-hoc Iterative Ridge algorithm of van Wieringen (2019). We consider Elastic Net type penalization for precision matrix estimation based on the Gaussian log-likelihood. To resolve this task, we propose two novel algorithms gelnet and dpgelnet . They substan-tially build on earlier algorithmic work for the glasso for precision matrix estimation (Friedman et al.,2008; Mazumder and Hastie, 2012c), the Elastic Net for regression (Zou and Hastie, 2005) and the inclu-sion of target matrices to encode some prior information towards which the estimator is shrunken (vanWieringen and Peeters, 2016; Kuismin et al., 2017). Advantages of our new work include methodological23xtensions enabling higher ﬂexibility for data analysis, the important option for including diagonal targetmatrices into the estimation methods, user-friendly software, eﬃcient implementation and the possibilityto choose diﬀerent methodological versions within the same software package. dpgelnet optimizes over the desired precision matrix in each block update, whereas gelnet workswith the covariance matrix, the inverse of the precision matrix. There is no overall advantage of usingone over the other algorithm and computational performance gains depend on the true underlying signal.All our algorithms are implemented eﬃciently in the R -package GLassoElnetFast using a combination of R and Fortran code. They are competitive in terms of computational times with competing methods.There seems to be no overall winner in terms of statistical performance. The inclusion of an L -norminto the Elastic Net penalty encourages additional stability, especially in presence of highly correlatedvariables. Moreover, our simulations show that target matrices and not penalizing the diagonal canbe powerful tools to improve the estimation of precision matrices. Our new software in the R -package GLassoElnetFast and its algorithmic methodology provide a unifying tool to use various modiﬁcations ofthe popular glasso algorithm.

Acknowledgement

Solt Kovács and Peter Bühlmann have received funding from the European Research Council (ERC)under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No.786461 CausalStats - ERC-2017-ADG).

References

Allen, G. I. (2010).

Transposable regularized covariance models with applications to high-dimensionaldata . PhD thesis, Stanford University, Department of Statistics.Atchadé, Y. F., Mazumder, R., and Chen, J. (2015). Scalable computation of regularized precisionmatrices via stochastic optimization. arXiv:1509.00426 .Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). Model selection through sparse maximumlikelihood estimation for multivariate Gaussian or binary data.

Journal of Machine Learning Research ,9:485–516.Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices.

The Annals ofStatistics , 36(1):199–227.Cai, T., Liu, W., and Luo, X. (2011). A constrained l minimization approach to sparse precision matrixestimation. Journal of the American Statistical Association , 106(494):594–607.Cai, T. T., Liu, W., and Zhou, H. H. (2016). Estimating sparse precision matrix: optimal rates ofconvergence and adaptive estimation.

The Annals of Statistics , 44(2):455–488.Candès, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much larger thann.

The Annals of Statistics , 35(6):2313–2351.Danaher, P., Wang, P., and Witten, D. M. (2012). The joint graphical lasso for inverse covarianceestimation across multiple classes.

Journal of the Royal Statistical Society, Series B , 76(2):373–397.Fan, J., Feng, Y., and Wu, Y. (2009). Network exploration via the adaptive lasso and scad penalties.

The Annals of Applied Statistics , 3(2):521–541.Friedman, J., Hastie, T., Höﬂing, H., and Tibshirani, R. (2007). Pathwise coordinate optimization.

TheAnnals of Applied Statistics , 1(2):302–332.Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with thegraphical lasso.

Biostatistics , 9(3):432–441. 24riedman, J., Hastie, T., and Tibshirani, R. (2019).

Graphical Lasso: Estimation of Gaussian GraphicalModels . glasso R package version 1.11.Guo, J., Levina, E., Michailidis, G., and Zhu, J. (2011). Joint estimation of multiple graphical models.

Biometrika , 98(1):1–15.Hastie, T., Tibshirani, R., and Wainwright, M. (2015).

Statistical Learning with Sparsity: The Lasso andGeneralizations . Chapman & Hall/CRC.Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. (2014). QUIC: Quadratic approximationfor sparse inverse covariance estimation.

Journal of Machine Learning Research , 15:2911–2947.Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., Ravikumar, P. K., and Poldrack, R. (2013). BIG & QUIC: Sparseinverse covariance estimation for a million variables. In

Advances in Neural Information ProcessingSystems 26 , pages 3165–3173.Jankova, J. and van de Geer, S. (2015). Conﬁdence intervals for high-dimensional inverse covarianceestimation.

Electronic Journal of Statistics , 9:1205–1229.Kovács, S., Li, H., Bühlmann, P., and Munk, A. (2020). Seeded binary segmentation: a general method-ology for fast and optimal change point detection. arXiv:2002.06633 .Kovács, S., Li, H., Haubner, L., Munk, A., and Bühlmann, P. (2020). Optimistic search strategy: changepoint detection for large-scale data via adaptive logarithmic queries. arXiv:2010.10194 .Kuismin, M. O., Kemppainen, J. T., and Sillanpää, M. J. (2017). Precision matrix estimation withROPE.

Journal of Computational and Graphical Statistics , 26(3):682–694.Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation.

The Annals of Statistics , 37(6):4254–4278.Lauritzen, S. L. (1996).

Graphical models . Oxford statistical science series. Oxford University Press.Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices.

Journal of Multivariate Analysis , 88(2):365–411.Liu, H. and Wang, L. (2017). TIGER: A tuning-insensitive approach for optimally estimating Gaussiangraphical models.

Electronic Journal of Statistics , 11(1):241–294.Londschien, M., Kovács, S., and Bühlmann, P. (2020). Change point detection for graphical models inthe presence of missing values.

Journal of Computational and Graphical Statistics, to appear .Marlin, B. M. and Murphy, K. P. (2009). Sparse Gaussian graphical models with unknown block structure.In

Proceedings of the 26th International Conference on Machine Learning , pages 705–713.Mazumder, R. and Hastie, T. (2012a). dpglasso: Primal Graphical Lasso . dpglasso R package version1.0.Mazumder, R. and Hastie, T. (2012b). Exact covariance thresholding into connected components forlarge-scale graphical Lasso.

Journal of Machine Learning Research , 13(1):781–794.Mazumder, R. and Hastie, T. (2012c). The graphical Lasso: new insights and alternatives.

ElectronicJournal of Statistics , 6:2125–2149.Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with thelasso.

The Annals of Statistics , 34(3):1436–1462.Peeters, C., Bilgrau, A., and van Wieringen, W. (2020). rags2ridges: Ridge Estimation of PrecisionMatrices from High-Dimensional Data . rags2ridges R package version 2.2.3.Peng, J., Wang, P., Zhou, N., and Zhu, J. (2009). Partial correlation estimation by joint sparse regressionmodels.

Journal of the American Statistical Association , 104(486):735–746.25avikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. (2011). High-dimensional covariance es-timation by minimizing L1-penalized log-determinant divergence.

Electronic Journal of Statistics ,5:935–980.Rothman, A. J. (2012). Positive deﬁnite estimators of large covariance matrices.

Biometrika , 99(3):733–740.Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariant covarianceestimation.

Electronic Journal of Statistics , 2:494–515.Scheinberg, K., Ma, S., and Goldfarb, D. (2010). Sparse inverse covariance selection via alternatinglinearization methods. In

Advances in Neural Information Processing Systems 23 , pages 2101–2109.Shan, L. and Kim, I. (2018). Joint estimation of multiple Gaussian graphical models across unbalancedclasses.

Computational Statistics & Data Analysis , 121:89–103.Städler, N. and Bühlmann, P. (2012). Missing values: sparse inverse covariance estimation and anextension to sparse regression.

Statistics and Computing , 22(1):219–235.Sustik, M. A. and Calderhead, B. (2012).

GLASSOFAST: An eﬃcient GLASSO implementation . UTCSTechnical Report TR-12-29:1-3.Sustik, M. A., Calderhead, B., and Clavel, J. (2018). glassoFast: Fast Graphical LASSO . glassoFast Rpackage version 1.0.Tan, K. M., Witten, D., and Shojaie, A. (2015). The cluster graphical lasso for improved estimation ofGaussian graphical models.

Computational Statistics & Data Analysis , 85:23–36.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.

Journal of the Royal StatisticalSociety, Series B , 58(1):267–288.Tseng, P. (2001). Convergence of a block coordinate descent method for nondiﬀerentiable minimization.

Journal of Optimization Theory and Applications , 109(3):475–494.van Wieringen, W. N. (2019). The generalized ridge estimator of the inverse covariance matrix.

Journalof Computational and Graphical Statistics , 28(4):932–942.van Wieringen, W. N. and Peeters, C. F. (2016). Ridge estimation of inverse covariance matrices fromhigh-dimensional data.

Computational Statistics & Data Analysis , 103:284–303.Wille, A., Zimmermann, P., Vranová, E., et al. (2004). Sparse graphical gaussian modeling of theisoprenoid gene network in arabidopsis thaliana.

Genome Biology , 5(11).Witten, D. M., Friedman, J. H., and Simon, N. (2011). New insights and faster computations for thegraphical Lasso.

Journal of Computational and Graphical Statistics , 20(4):892–900.Yang, Y. and Zou, H. (2017). gcdnet: LASSO and Elastic Net (Adaptive) Penalized Least Squares, LogisticRegression, HHSVM, Squared Hinge SVM and Expectile Regression using a Fast GCD Algorithm .gcdnet R package version 1.0.5.Yuan, M. (2010). High-dimensional inverse covariance matrix estimation via linear programming.

Journalof Machine Learning Research , 11:2261–2286.Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model.

Biometrika ,94(1):19–35.Zhou, S., Rütimann, P., Xu, M., and Bühlmann, P. (2011). High-dimensional covariance estimation basedon Gaussian graphical models.

Journal of Machine Learning Research , 12:2975–3026.Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.

Journal of theRoyal Statistical Society, Series B , 67(2):301–320.26

Appendix

A.1 Details on target matrices

We discuss the inclusion of a positive semi-deﬁnite (diagonal) target matrix T into the Graphical ElasticNet problem. We aim to solve problem (1) which we recall here:ˆ Θ ( λ, α, T ) = argmin Θ (cid:31) {− log det Θ + tr( SΘ ) + λ ( α k Θ − T k + − α k Θ − T k ) } . This optimization problem is diﬃcult to solve eﬃciently in general. We propose to simplify the problemby considering only positive semi-deﬁnite diagonal target matrices. When considering diagonal targetmatrices with non-negative entries, the normal equations for the diagonal entries are diﬀerent, while forthe other entries the normal equations remain the same. For each diagonal entry three cases can occur,leading to the following normal equations:Case 1: θ > t , then w = s + λα + λ (1 − α )( θ − t ) . Case 2: θ = t , then w = s + λαu, where u ∈ [ − , . Case 3: θ < t , then w = s − λα + λ (1 − α )( θ − t ) . Note that in the traditional case when zero is the target case 1 occurs always. In the general case,however, one cannot automatically update diagonal elements based on case 1. The diﬃculty is that onedoes not know in advance the precision matrix Θ and hence, whether updates corresponding to case 1, 2or 3 would be suitable for the diagonal entries. This decision has to be made “on the ﬂy”. In the followingwe derive the technical modiﬁcations on how to modify the gelnet algorithm. Overall, in contrast tothe algorithm without target, the initialization and updating for the diagonal entries are diﬀerent. Special Case.

In the special case, where all oﬀ-diagonal elements of S are less or equal to λα , thesolutions Θ and W are by the results on connected components (see Theorem 1) diagonal matrices andthus θ = w for each diagonal entry. If α = 1 then w = s + λ and thus θ = s + λ . We considernow α = 1. By the condition that θ > w > θ > t ≥ θ = w = s + λα + λ (1 − α )( θ − t ) > s + λα . Therefore t < θ < s + λα . In other words case 1 is fulﬁlled if t ∈ [0 , s + λα ).Case 2: θ = t ≥ t = θ = w = s + λαu, where u ∈ [ − , λα < s .Then t = s + λαu , which is equivalent to t ∈ [ s + λα , s − λα ]. On the other hand if λα ≥ s , then t ∈ [ s + λα , ∞ ) follows, since the target cannot be negative.Case 3: θ < t ≥ θ = w = s − λα + λ (1 − α )( θ − t ) < s − λα . Therefore, if λα ≥ s then case 3 will never be fulﬁlled. Else t > θ > s − λα . In other words case 3 is fulﬁlled if λα < s and t ∈ ( s − λα , ∞ ).Hence, for this special case, one can get the exact values of θ by ﬁrst determining with t whichcase occurs. If θ = t a quadratic equation has to be solved. The values for w follow by w = θ .Case 1: θ = − ( s + λα − λ (1 − α ) t ) + p ( s + λα − λ (1 − α ) t ) + 4 λ (1 − α )2 λ (1 − α ) (15)Case 2: θ = t θ = − ( s − λα − λ (1 − α ) t ) + p ( s − λα − λ (1 − α ) t ) + 4 λ (1 − α )2 λ (1 − α ) General Case.

Note that in the general case, where the oﬀ-diagonals of S are no longer smaller orequal to λα the special case still gives some intuition. Namely, for small values in the target case 1 ispresent and the larger the values of the target get, the more likely case 3 is the suitable one. As weassume that the target matrix is diagonal, and hence has non-zero elements only at the diagonal entries,the core of the gelnet algorithm stays the same. The changes to be made are the following:• Initialize such that both W and Θ are positive semi-deﬁnite and preferably the suitable diagonalcase is chosen.• After solving the quadratic programs, update such that the diagonal entries can switch cases ifneeded. Updates.

The updates of ˆ w and ˆ θ (Steps 2 d and f in Algorithm 3) diﬀer from the gelnet algo-rithm without target. To ensure that ˆ w and ˆ θ fall into the same case a simultaneous update of bothis needed. In practice, the following updates work for realistic targets, but may not converge for targetswith very large diagonal entries. However, such targets with overly large diagonal entries are typicallyanyway not desirable from a statistical perspective.Using the relation from Step 2d in Algorithm 3 w can be expressed as w = θ + w β . Substitut-ing this expression into the diagonal normal equation leads to λα sgn( θ − t ) = θ + w β − s − λ (1 − α )( θ − t ) . (16)Deﬁne the function F , which represents λα sgn( θ − t ) as: F ( θ ) := θ + w β − s − λ (1 − α )( θ − t ) . Note the range of values for F is [ − λα, λα ]. Deﬁne for θ = t the function value F t as F t := F ( t ) = t + w β − s . Since F is strictly decreasing in θ ≥

0, we have that: F ( θ ) < F t for θ > t and F ( θ ) > F t for θ < t .By the representation above we need for θ > t that F ( θ ) = λα and for θ < t that F ( θ ) = − λα .Considering F t ∈ [ − λα, λα ], we show by contradiction that θ = t .Assume θ > t then λα = F ( θ ) < F t ≤ λα .On the other hand if θ < t then − λα = F ( θ ) > F t ≥ − λα .By similar arguments one can show that if F t > λα , we have that θ > t and lastly if F t < − λα , wehave that θ < t .Therefore, the value of F t determines the case to be considered and can be seen as a “test”. We need tosolve equation (16) in θ dependent on the case chosen.Case 1: Solve the following for θ :0 = λ (1 − α ) θ + ( s + λα − λ (1 − α ) t − w β ) θ − . Then update w = s + λα + λ (1 − α )( θ − t ).Case 2: θ = t , w = s + F t .Case 3: Solve the following for θ :0 = λ (1 − α ) θ + ( s − λα − λ (1 − α ) t − w β ) θ − . Then update w = s − λα + λ (1 − α )( θ − t ).28 nitialization. Assume α >

0. As in gelnet , the diagonal entries of Θ are produced ﬁrst and then W is taken as W = S + λα Γ + λ (1 − α )( Θ − T ). This time Γ is the diagonal matrix with γ ii = 1 if t ii ∈ [0 , s ii + λα ) and 0 else. If t ii ∈ [0 , s ii + λα ) then deﬁne θ ii as in (15) else set θ ii = t ii . The Final Algorithm.

In the previous paragraphs we discussed the necessary technical modiﬁcationsregarding updates and initialization for the gelnet algorithm to incorporate diagonal target matrices.These are again summarized in Algorithm 5. Note again that for very large entries in the target matrix(which are usually beyond what is reasonable from a statistical perspective) our chosen updates might faceconvergence issues when used with the gelnet algorithm. For none of the targets we used in simulations(see section 2.1), we experienced convergence issues. These reasonable targets are “conservative”, i.e.typically not having overly large entries.

Algorithm 5 gelnet algorithm with diagonal target matrix T Initialize Θ init and W init as described in the Initialization paragraph above. Cycle around the columns repeatedly, performing the following steps till convergence: a: Rearrange the rows/columns so that the target column is last (implicitly). b: Solve the Elastic Net regression problem (9) with coordinate descent to get ˆ β . As warm start for β use the solution from the previous round for this row/column. c: Update ˆ w = W ˆ β , ˆ w = ˆ w T . d: Calculate the test statistic F t := F ( θ ) as in the Updates paragraph to determine the case. e: Update ˆ θ according to the case. f: Update ˆ θ = − ˆ θ ˆ β , ˆ θ = ˆ θ T . g: Update ˆ w = s + λα + λ (1 − α )ˆ θ22