Distributionally Robust Formulation and Model Selection for the Graphical Lasso
DDistributionally Robust Formulation and Model Selection for theGraphical Lasso
Pedro Cisneros-Velarde , Sang-Yun Oh , Alexander Petersen Center for Control, Dynamical Systems and Computation ,Department of Statistics and Applied Probability ,University of California, Santa Barbara Abstract
Building on a recent framework for distributionally robust optimization, we consider estimationof the inverse covariance matrix for multivariate data. We provide a novel notion of a Wassersteinambiguity set specifically tailored to this estimation problem, leading to a tractable class of regularizedestimators. Special cases include penalized likelihood estimators for Gaussian data, specifically thegraphical lasso estimator. As a consequence of this formulation, the radius of the Wasserstein ambiguityset is directly related to the regularization parameter in the estimation problem. Using this relationship,the level of robustness of the estimation procedure can be shown to correspond to the level of confidencewith which the ambiguity set contains a distribution with the population covariance. Furthermore, aunique feature of our formulation is that the radius can be expressed in closed-form as a function of theordinary sample covariance matrix. Taking advantage of this finding, we develop a simple algorithm todetermine a regularization parameter for graphical lasso, using only the bootstrapped sample covariancematrices, meaning that computationally expensive repeated evaluation of the graphical lasso algorithmis not necessary. Alternatively, the distributionally robust formulation can also quantify the robustnessof the corresponding estimator if one uses an off-the-shelf method such as cross-validation. Finally, wenumerically study the obtained regularization criterion and analyze the robustness of other automatedtuning procedures used in practice.
In statistics and machine learning, the covariance matrix Σ of a random vector X ∈ R d is a fundamentalquantity for characterizing marginal pairwise dependencies between variables. Furthermore, the inversecovariance matrix Ω = Σ − provides information about the conditional linear dependency structurebetween the variables. For example, in the case that X is Gaussian, Ω jk = 0 if and only if the j th and k th variables of X are conditionally independent given the rest. Such relationships are of interest in manyapplications such as environmental science, biology, and neuroscience (Guillot et al., 2015; Huang et al.,2010; Krumsiek et al., 2011), and have given rise to various statistical and machine learning methods forinverse covariance estimation.Given an independent sample X i ∼ X, i = 1 , . . . , n, the sample covariance A n = n (cid:80) ni =1 X i X (cid:62) i , whichis the maximum likelihood estimator if X is Gaussian, can be a poor estimator of Σ unless d/n is verysmall. Driven by a so-called high dimensional setting where n (cid:28) d , where A n is not invertible, regular-ized estimation of the precision matrix has gained significant interest (Cai et al., 2011; Edwards, 2010;Friedman et al., 2007; Khare et al., 2015; Won et al., 2012; Yuan and Lin, 2007). Such regularizationprocedures are useful even when A n is a stable estimate (i.e., positive definite with small condition num-ber), since the inverse covariance estimate A − n is dense and will not reflect the sparsity of correspondingnonzero elements in Ω. Distributionally Robust Optimization : Let S d be the set of d × d symmetric matrices, and S ++d ⊂ S d be the subset of positive definite symmetric matrices. Given a loss function l ( X ; K ) for K ∈ S ++d a r X i v : . [ s t a t . M L ] O c t nd X ∈ R d , a classical approach would be to estimate Ω by minimizing the empirical loss (cid:80) ni =1 l ( X i ; K )over K ∈ S ++d , perhaps including a regularization or penalty term. The error in this estimate arisesfrom the discrepancy between the true data-generating distribution and the observed training samples,and can be assessed by various tools such as concentration bounds or rates of convergence. In contrast, distributionally robust optimization (DRO) is a technique that explicitly incorporates uncertainty aboutthe distribution of the X i into the estimation procedure. For an introduction on the general topic of DRO,we refer to the works (Shafieezadeh-Abadeh et al., 2015; Blanchet and Murthy, 2016; Shafieezadeh-Abadehet al., 2019) and the references therein. In the context of inverse covariance estimation, a distributionallyrobust estimate of Ω is obtained by solvinginf K ∈ S ++d sup P ∈S E P [ l ( K ; X )] , (1)where S , known as an ambiguity set , is a collection of probability measures on R d . As pointed outin (Blanchet and Murthy, 2016), a natural choice for S is the neighborhood { P | D ( P, µ ) ≤ δ } , with µ being a chosen baseline model, δ being some tolerance level which defines the uncertainty size of theambiguity set, and D being some discrepancy metric between two probability measures. In a practicalsetting, we have access to some samples (or data points) from the unknown distribution, and thus, a goodcandidate for the baseline model is the empirical measure.Very recent work by Nguyen et al. (2018), also analyzed by Blanchet and Si (2019), used the DROframework to construct a new regularized (dense) inverse covariance estimator. Working under theassumption that X is Gaussian, the authors construct an ambiguity set S of Gaussian distributionsthat, up to a certain tolerance level, are consistent with the observed data. Recent work on DRO inother machine learning problems has revealed explicit connections to well-known regularized estimators,specifically regularized logistic regression (Shafieezadeh-Abadeh et al., 2015) and the square-root lasso forlinear models (Blanchet et al., 2016); however, such a connection to regularized sparse inverse covarianceestimators that are used in practice has yet to be made. The Graphical Lasso : One of the most common methods to recover the sparsity pattern in Ω is toadd an l -regularization term to the Gaussian likelihood function, motivated by the consideration of the Gaussian Graphical Model (GGM). A sparse estimate of Ω = Σ − is produced by minimizing L λ ( K ) = 1 n n (cid:88) i =1 X (cid:62) i KX i − log | K | + λ d (cid:88) i =1 d (cid:88) j =1 | k ij | , (2)where k ij is the ( i, j ) entry of K and λ > graphical lasso estimator (Friedman et al., 2007). The first two terms of (2)are related to Stein’s loss (James and Stein, 1961) when evaluated at the empirical measure, and alsocorrespond to the negative log-likelihood up to an additive constant if X is Gaussian. The performanceof the graphical lasso estimator in high-dimensional settings has been investigated (Rothman et al., 2008;Jankova and van de Geer, 2018), as well as modifications and extensions that implement some notion of robustness , i.e., for making it robust to outliers or relaxing the normality assumptions in the data (Khareet al., 2015; Lam and Fan, 2009; Loh and Tan, 2018; Xue and Zou, 2012; Yang and Lozano, 2015).Besides its theoretical relevance, the graphical lasso and its extensions also enjoy many practicaladvantages. For example, it has been used as a network inference tool. In these applications, theprecision matrix can indicate which nodes in a network are conditionally independent given informationfrom remaining nodes, thus giving an indication of network functionality. This has been important inneuroscience applications when studying the inference of brain connectivity (Yang et al., 2015; Smithet al., 2011; Huang et al., 2010). Applications in gene regulatory networks and metabolomics have alsobeen reported (Men´endez et al., 2010; Sulaimanov et al., 2018; Krumsiek et al., 2011).2he performance of the graphical lasso estimator hinges critically on the choice of λ. While there havebeen studies on how to properly tune λ to obtain a consistent estimator or to establish correct detectionof nonzero elements in the precision matrix (Rothman et al., 2008; Banerjee et al., 2008; Mazumder andHastie, 2012), in practice, this selection is often made through automated methods like cross-validation. Contributions : In this paper, we propose a distributionally robust reformulation of the graphicallasso estimator in (2). Following Shafieezadeh-Abadeh et al. (2015); Blanchet et al. (2016); Esfahani andKuhn (2018), we utilize the Wasserstein metric to quantify distributional uncertainty for the constructionof the ambiguity set. The following points summarize our main contributions. • We formulate a class of DRO problems for inverse covariance estimation, leading to a tractableclass of (cid:96) p -norm regularized estimators. As the graphical lasso estimator (2) is a special case, thisprovides us with a new interpretation of this popular technique. This DRO formulation is madepossible by a novel type of ambiguity set, now defined as a collection of measures on matrices.This nontrivial adaptation is necessary due to the fact that a direct generalization of other DROapproaches using vector-valued data (e.g., Shafieezadeh-Abadeh et al. (2019)) does not result ina closed-form regularization problem, and thus does not provide the desired connection to thegraphical lasso. • We use this formulation to suggest a criterion for the selection of the regularization parameter inthe estimation problem in the classical regime n > d . This criterion follows the
Robust WassersteinProfile (RWP) inference recently introduced by Blanchet et al. (2016), which makes no assumptionon the normality of the data, and which we tailor to our specific problem. The proposed criterionexpresses the regularization parameter as an explicit function of the sample covariance A n , unlikeother instances where RWP has been implemented which rely on stochastic dominance arguments. • We formulate a novel robust selection (RobSel) algorithm for regularization parameter choice. Fo-cusing on the graphical lasso, we provide numerical results that compare the performance of cross-validation and our proposed algorithm for the selection of the regularization term.The paper is organized as follows. In Section 2 we describe our main theoretical result: the distribu-tionally robust formulation of the regularized inverse covariance (log-likelihood) estimation, from whichgraphical lasso is a particular instance. In Section 3 we propose a criterion for choosing the regularizationparameter inspired by this formulation and outline the bootstrap-based RobSel algorithm for its compu-tation. In Section 4 we present some numerical results comparing the proposed criterion of Section 3 withcross-validation. Finally, we state some concluding remarks and future research directions in Section 5.All proofs of theoretical results can be found in the supplementary material.
First, we provide preliminary details on notation. Given a matrix A ∈ R d × d , a jk denotes its ( j, k ) entryand vec ( A ) ∈ R d denotes its vectorized form, which we assume to be in a row major fashion. Formatrices denoted by Greek letters, its entries are simply denoted by appropriate subscripts, i.e. Σ jk . Theoperator | · | , when applied to a matrix, denotes its determinant; when applied to a scalar or a vector, itdenotes the absolute value or entry-wise absolute value, respectively. The (cid:96) p -norm of a vector is denotedby (cid:107)·(cid:107) p . We use the symbol ⇒ to denote convergence in distribution.Recall that X ∈ R d is a zero-mean random vector with covariance matrix Σ ∈ S ++d . Let Q be theprobability law for X and Ω = Σ − be the precision matrix. Define the graphical loss function as l ( X ; K ) = X (cid:62) KX − log | K | = trace( KXX (cid:62) ) − log | K | . (3)Then E Q [ l ( K ; X )] = trace( K Σ) − log | K | is a convex function of K over the convex cone S ++d . Using thefirst-order optimality criterion, we observe that K = Ω sets the gradient ∂∂K E Q [ l ( X, K )] = Σ − K − equal3o the zero matrix (see (Boyd and Vandenberghe, 2004, Appendix A) for details on this differentiation).Hence, arg min K ∈ S ++d E Q [ l ( X ; K )] = Ω . so that (3) is a consistent loss function.Now, if we consider an iid random sample X , · · · , X n ∼ X , n > d, with empirical measure Q n , thenarg min K ∈ S ++d E Q n [ l ( X ; K )] = arg min K ∈ S ++d n n (cid:88) i =1 l ( X i ; K )= A − n with A n = n (cid:80) ni =1 X i X (cid:62) i . Thus, as described in the Introduction, a natural approach would be toimplement the DRO procedure outlined in Esfahani and Kuhn (2018) by building an ambiguity setbased on perturbations of Q n , leading to the DRO estimate given by (1). However, this approach doesnot convert (1) into a regularized estimation problem as desired, since the inner supremum cannot beexplicitly given in closed-form. For more details, see section A in the supplementary material.As an alternative, let P represent the measure of the random matrix W = XX (cid:62) on S d induced by Q and, similarly let P n be empirical measure of the sample W i = X i X (cid:62) i , i = 1 , . . . , n. Redefining thegraphical loss function l : S d × S ++d as l ( W ; K ) = trace( KW ) − log | K | , (4)then Ω = arg min K ∈ S ++d E P [ l ( W ; K )] A − n = arg min K ∈ S ++d E P n [ l ( W ; K )] . This observation leads to a tractable DRO formulation by constructing ambiguity sets built around theempirical measure P n . The DRO formulation for inverse covariance estimation becomesmin K ∈ S ++d sup P : D c ( P, P n ) ≤ δ E P [ l ( W ; K )] . (5)The ambiguity set in this formulation is specified by the collection of measures { P | D c ( P, P n ) ≤ δ } ,which we now describe. Given two probability distributions P and P on S d and some transportationcost function c : S d × S d → [0 , ∞ ) (which we will specify below), we define the optimal transport cost between P and P as D c ( P , P ) = inf { E π [ c ( U, V )] | π ∈ P ( S d × S d ) , π U = P , π V = P } (6)where P ( S d × S d ) is the set of joint probability distributions π of ( U, V ) supported on S d × S d , and π U and π V denote the marginals of U and V under π , respectively. In this paper, we are interested in costfunctions c ( U, V ) = (cid:107) vec ( U ) − vec ( V ) (cid:107) ρq , (7)with U, V ∈ S d , ρ ≥ q ∈ [1 , ∞ ]. As pointed out by Blanchet et al. (2016), the resulting optimaltransport cost D /ρc is the Wasserstein distance of order ρ. Our first theoretical result demonstrates thatthe optimization in (5) corresponds to a class of regularized estimators under the graphical loss function(4). 4 heorem 2.1 (DRO formulation of regularized inverse covariance estimation) . Consider the cost functionin (7) for a fixed ρ ≥ . Then, min K ∈ S ++d sup P : D c ( P, P n ) ≤ δ E P (cid:2) l ( W ; K ) (cid:3) = min K ∈ S ++d (cid:110) trace( KA n ) − log | K | + δ /ρ (cid:107) vec ( K ) (cid:107) p (cid:111) , (8) where p + q = 1 . Theorem 2.1 is a remarkable theoretical result that provides a mapping between the regularizationparameter and the uncertainty size δ of the ambiguity set in the DRO formulation. Then, the regular-ization problem reduces to determining a good criterion for choosing δ , which we explore in Section 3.Moreover, we obtain the graphical lasso formulation (2) by setting q = ∞ in (7). From (8), a smallerambiguity set implies less robustness being introduced in the estimation problem by reducing the im-portance of the regularization term. Conversely, a larger regularization term increases the number ofnuisance distributions inside the ambiguity set, and thus the robustness. Remark 2.2.
The ambiguity set used in (8) makes no assumptions on the normality of the distributionof the samples { X , . . . , X n } . Then, (8) tells us that adding a penalization to the precision matrix givesa robustness in terms of the distributions that the samples may have, which do not necessarily have to beGaussian for this formulation to hold. Furthermore, it holds independent of the relationship between n and d. This section follows closely the line of thought recently introduced by Blanchet et al. (2016) in theanalysis of regularized estimators under the DRO formulation. Specifically, we will demonstrate that theambiguity set { P : D c ( P, P n ) ≤ δ } represents a confidence region for Ω = Σ − , and use the techniquesof Blanchet et al. (2016) to explicitly connect the amibiguity size δ with a confidence level. As previouslystated, l ( W ; K ) is a differentiable function on K ∈ S ++d with ∂∂K l ( W ; K ) = W − K − , so that E P (cid:20) ∂∂K l ( W ; K ) (cid:12)(cid:12)(cid:12) K =Ω (cid:21) = d × d . (9)Hence, even though the loss function l ( W ; K ) has been inspired from the log-likehood estimation ofthe covariance matrix Σ for samples of Gaussian random vectors, equation (9) is transparent to anyunderlying distribution of the data. For any K ∈ S ++d , define the set O ( K ) := (cid:110) P ∈ P ( S d ) (cid:12)(cid:12)(cid:12) E P (cid:20) ∂∂K (cid:48) l ( W ; K (cid:48) ) (cid:12)(cid:12)(cid:12) K (cid:48) = K (cid:21) = d × d (cid:111) , (10)corresponding to all probability measures with covariance K − , i.e. for which K is an optimal lossminimization parameter; here, P ( S d ) denotes the set of all probability distributions supported on S d .Thus, O (Ω) contains all probability measures with covariance matrix agreeing with that of X. Implicitly, the Wasserstein ambiguity set { P : D c ( P, P n ) ≤ δ } is linked to the collection of covariancematrices C n ( δ ) := { K ∈ S ++d | there exists P ∈ O ( K ) ∩ { P | D c ( P, P n ) ≤ δ }} = (cid:91) P : D c ( P, P n ) ≤ δ arg min K ∈ S ++d E P [ l ( W ; K )] . (11)We refer to C n ( δ ) as the set of plausible selections for Ω.5 emma 3.1 (Interchangeability in the DRO formulation) . Consider the setting of Theorem 2.1. Then,for n > d, the following holds with probability one: inf K ∈ S ++d sup P : D c ( P, P n ) ≤ δ E P (cid:2) l (cid:0) W ; K (cid:1)(cid:3) = sup P : D c ( P, P n ) ≤ δ inf K ∈ S ++d E P (cid:2) l (cid:0) W ; K (cid:1)(cid:3) . (12)Lemma 3.1 states that any estimator obtained by minimizing the left-hand side of (12) must be in C n ( δ ), otherwise the right-hand side of (12) would be strictly greater than the left. Thus, in line with thegoal of providing a robust estimator, the idea is to choose δ so that C n ( δ ) also contains the true inversecovariance matrix Ω with high confidence.As P is the weak limit of P n , we will eventually have that Ω ∈ C n ( δ ) with high probability for any δ, so that C n ( δ ) is a confidence region for Ω . From this observation, we can choose the uncertainty size δ optimally by the criterion δ = inf { δ > | P (Ω ∈ C n ( δ )) ≥ − α } , (13)i.e., for a specified confidence level 1 − α , we choose δ so that C n ( δ ) is a (1 − α )-confidence region for Ω.To continue our anlaysis, we make use of the so-called Robust Wasserstein Profile (RWP) function R n introduced by Blanchet et al. (2016), R n ( K ) = inf {D c ( P, P n ) | P ∈ O ( K ) } = inf (cid:110) D c ( P, P n ) (cid:12)(cid:12)(cid:12) E P (cid:20) ∂∂K (cid:48) l ( W ; K (cid:48) ) (cid:12)(cid:12)(cid:12) K (cid:48) = K (cid:21) = d × d (cid:111) , (14)for K ∈ S ++d , which has the geometric interpretation of being the minimum distance between the empiricaldistribution and any distribution that satisfies the optimality condition for the precision matrix K . Then,using the equivalence of events { Ω ∈ C n ( δ ) } = {O (Ω) ∩ { P | D c ( P, P n ) ≤ δ } (cid:54) = ∅ } = { R n (Ω) ≤ δ } , (13)becomes equivalent to δ = arg inf { δ > | P ( R n (Ω) ≤ δ ) ≥ − α } , (15)i.e., the optimal selection of δ is the 1 − α quantile of R n (Ω). Indeed, the set { P | D c ( P, P n ) ≤ R n (Ω) } is the smallest ambiguity set around the empirical measure P n such that there exists a distribution forwhich Ω is an optimal loss minimization parameter. In contrast to previously reported applications ofthe RWP function on linear regression and logistic regression (Blanchet et al., 2016), our problem allowsfor a (finite sample) closed form expression of this function. This is due to the fact that we have recastthe covariance Σ as the mean of the random matrix XX (cid:62) , so that the following result gives a nontrivialgeneralization of (Blanchet et al., 2016, Example 3). Theorem 3.2 (RWP function) . Consider the cost function in (7) for a fixed ρ ≥ . For K ∈ S ++d , consider R n ( K ) as in (14) . Then, R n ( K ) = (cid:13)(cid:13) vec ( A n − K − ) (cid:13)(cid:13) ρq . (16)We now establish important convergence guarantees on the RWP function in the following corollary. Corollary 3.3 (Asymptotic behavior of the RWP function) . Suppose that the conditions of Theorem 3.2hold, and that E Q ( (cid:107) X (cid:107) ) < ∞ . Let H ∈ S d be a matrix of jointly Gaussian random variables with zeromean and such that Cov( h ij , h k(cid:96) ) = E [ w ij w k(cid:96) ] − Σ ij Σ k(cid:96) = E [ x i x j x k x (cid:96) ] − Σ ij Σ k(cid:96) . Then, n ρ/ R n (Ω) ⇒ (cid:107) vec ( H ) (cid:107) ρq . (17) Proof.
By the central limit theorem, we observe that √ n ( A n − Σ) ⇒ H , and by the continuous mappingtheorem, we get that n ρ/ R n (Ω) = (cid:13)(cid:13) √ n vec ( A n − Σ) (cid:13)(cid:13) ρq ⇒ (cid:107) vec ( H ) (cid:107) ρq . emark 3.4. Turning our attention back to Theorem 2.1, a robust selection for the ambiguity size orregularization parameter λ = δ /ρ , as obtained from Theorem 3.2, is δ /ρ = inf (cid:110) δ > | P ( (cid:107) vec ( A n − Σ) (cid:107) q ≤ δ ) ≥ − α (cid:111) (18) As a result, this robust selection for λ results in a class of estimators, given by minimizers of the right-hand side of (8) , that are invariant to the choice of ρ in (7) . Thus, for simplicity, we will set ρ = 1 inthe remainder of the paper. Remark 3.5.
Let r − α be the (1 − α ) quantile from the distribution of the right-hand side of (17) . Then,for any fixed α, the robust selection δ in (18) satisfies n / δ → r − α , so that the optimal decay rate of n − / for λ is automatically chosen by the RWP function. As solving (18) requires knowledge of Σ, we now outline the robust selection (RobSel) algorithm fordata-adaptive choice of the regularization parameter δ for our inverse covariance estimation with an (cid:96) p penalization parameter. The special case p = 1 corresponds to the graphical lasso in (2), in which case wewill also use the notation δ = λ . The asymptotic result in Corollary 3.3 invokes a central limit theorem,and thus motivates the approximation of the RWP function through bootstrapping, which we furtherexplain and evaluate its numerical performance in the next section. Let α ∈ (0 ,
1) be a prespecifiedconfidence level and B a large integer such that ( B + 1)(1 − α ) is also an integer. Algorithm
RobSel algorithm for estimation of the regularization parameter λ For b = 1 , . . . , B , obtain a bootstrap sample X ∗ b , . . . , X ∗ nb by sampling uniformly and with replacementfrom the data, and compute the bootstrap RWP function R ∗ n,b = (cid:13)(cid:13)(cid:13) A ∗ n,b − A n (cid:13)(cid:13)(cid:13) q , with the empiricalcovariance A ∗ n,b computed from the bootstrap sample. Set λ to be the bootstrap order statistic R ∗ n, (( B +1)(1 − α )) .RobSel can potentially provide considerable computational savings over cross-validation in practice.Computing sample covariance matrices for each of B bootstrap samples has cost O ( Bnd ). On the otherhand, it is known that each iteration of graphical lasso can cost O ( d ) in the worst case (Mazumderand Hastie, 2012); therefore, performing an F -fold cross-validation to search over L -grid of regularizationparameters, each taking T -iterations of graphical lasso, would cost O ( F LT d ). The true precision matrix Ω ∈ S ++d used to generate simulated data has been constructed as follows.First, generate an adjacency matrix of an undirected Erd˝os-Renyi graph with equal edge probability andwithout self-loops. Then, the weight of each edge is sampled uniformly between [0 . , . d = 100) is fixed. Using this Ω, a total of N = 200 datasets (of varying size n ) were generated asindependent observations from a multivariate zero-mean Gaussian distribution, i.e., N ( d , Ω − ).Consider the problem of choosing the regularization parameter λ (equivalently, the ambiguity sizeparameter δ ) to obtain graphical lasso estimates ˆ K λ of Ω using the simulated datasets. An R softwarepackage, glasso , from CRAN was used throughout our numerical experiments. Below, we compare twodifferent criteria for choosing λ . The first criterion is Robust Selection (RS) , which follows our proposedRobSel algorithm with B = 200 sets of bootstrap samples. We present here results mainly for n > d, butadditional results in the high-dimensional regime n < d can be found in the supplementary material. The7econd criterion is a 5 -fold cross-validation (CV) procedure. The performance on the validation set is theevaluation of the graphical loss function under the empirical measure of the samples on the training set.Recall the elements in the confusion matrix to be true positives (TP), true negatives (TN), falsepositives (FP) and false negatives (FN). We compare model selection performance of λ chosen by the twodifferent approaches: λ RS and λ CV . The following comparison metrics are used: • True positive rate (TPR) and false detection rate (FDR) : T P R = T PT P + F N is the proportion ofnonzero entries of Ω that are correctly identified in ˆ K λ , and F DR = F NF N + T P is the proportion ofzero entries of Ω that are incorrectly identified as nonzeros in ˆ K λ . • Matthew’s Correlation Coefficient (MCC) : MCC summarizes all counts in the confusion matrix incontrast to other measures like TPR and FDR. More details about MCC is given in supplementalsubsection D.1.In the remainder of this section, we compare the model selection performance (FDR, TPR, MCC)from our simulation results. As mentioned in Remark 3.5, supplemental subsection D.2 shows that λ RS decreases as n increases as is also observed to be true with λ CV . Furthermore, across the testedrange of α , the regularization λ RS are all larger than λ CV for any n . Then, our distributionally robustrepresentation (8) allows us to observe that even for small values of n , CV always chooses a λ thatcorresponds to smaller ambiguity sets than RS.To assess the accuracy of RobSel in estimating λ = δ for a given α , we approximated the right-handside of (18) using the N = 200 data sets and the true covariance Σ , giving the “true” value λ RW P .Figures in supplemental subsection D.3 show that the performance obtained by λ RW P is similar to theone obtained by λ RS for all comparison metrics. This finite sample behavior of RS indicates that theRobSel bootstrap algorithm reliably approximates the desired robustness level corresponding to the choiceof α. These plots also indicate that the RS criterion is more conservative than CV in achieving a lowerFDR across different sample sizes, due to providing larger values for λ (see supplemental subsection D.2).More specifically, Fig. 1 shows that RS gives a better performance than CV in terms of FDR even forsmaller values of n and this performance improves even more as n increases.Moreover, the trade-off between the preference for robustness and the preference for a higher densityestimation of nonzero entries in the precision matrix can be observed in terms of the Matthews correlationcoefficient (MCC), as shown in Fig. 2. Higher values in the curve of MCC implies values of λ that describea better classification of the entries of Ω as either zero or nonzero (see supplemental subsection D.1 formore details). We observe that CV is placed at the left of the optimal value of the MCC and its under-performance is due to the overestimation of nonzero entries in Ω. On the other hand, RS is to the right ofthe optimal value and its under-performance is due to its conservative overestimation of zero entries in Ωinduced by its robust nature. Then, it is up to the experimenter to know which method to use dependingon whether she desires to control for the overestimation of zero or nonzero entries. Remarkably, for largenumbers of n , RS seems to be much closer to the optimal performance than CV according to the MCC,and it does this by maintaining a lower FDR than CV while increasing its TPR.Our results from the MCC analysis and supplemental subsection D.3 also indicate that we should aimfor higher values of α if we want a performance closer to CV in terms of TPR when using the RS criterion,with the advantage of still maintaining a better performance than CV in terms of the FDR. In contrast,if we want more conservative results, we should aim for lower values of α . This is a good property ofRS: it allows the use of a single parameter α ∈ (0 ,
1) to adjust the importance of the regularization term.As mentioned before section 4, a practical importance of RS is that it provides a candidate for λ withpotentially considerable computational savings over CV. We provide a recharacterization of the popular graphical lasso estimator in (2) as the optimal solutionto a distributionally robust optimization problem. To the best of our knowledge, this is the first work8o make such a connection for sparse inverse covariance estimation. The DRO form of the estimatorleads to a reinterpretation of the regularization parameter as the radius of the distributional amibiguityset, which can be chosen based on a desired level of robustness. We propose the RobSel method forthe selection of the regularization parameter and compare it to cross-validation for the graphical lasso.In our numerical experiments, RobSel gives a better false detection rate than cross-validation, and, asthe sample size increases, other performance metrics like the true positive rate for the two are similar.Moreover, RobSel is a computationally simpler procedure, notably only performing the graphical lassoalgorithm once at the final step rather than repeatedly as is necessary for cross-validation. Future workincludes theoretical justification for robust selection of the regularization parameter for the graphicallasso in the high-dimensional setting.
References
O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihoodestimation for multivariate Gaussian or binary data.
J. Mach. Learn. Res. , 9:485–516, 2008. ISSN1532-4435.J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport. 2016.doi:arXiv:1604.01446v2. URL https://arxiv.org/pdf/1604.01446.pdf .J. Blanchet and N. Si. Optimal uncertainty size in distributionally robust inverse covariance estimation.2019. doi:arXiv:1901.07693. URL ://arxiv.org/pdf/1901.07693.pdf .J. Blanchet, Y. Kang, and K. Murthy. Robust Wasserstein profile inference and applications to machinelearning. 2016. doi:arXiv:1610.05627v2. URL https://arxiv.org/pdf/1610.05627.pdf .S. Boyd and L. Vandenberghe.
Convex Optimization . Cambridge University Press, 2004. ISBN0521833787.T. Cai, W. Liu, and X. Luo. A constrained (cid:96) minimization approach to sparse precision ma-trix estimation. Journal of the American Statistical Association , 106(494):594–607, jun 2011.doi:10.1198/jasa.2011.tm10155.D. Chicco. Ten quick tips for machine learning in computational biology.
BioData Mining , 10(1):35, Dec2017. doi:10.1186/s13040-017-0155-3.D. Edwards.
Introduction to Graphical Modelling . Springer-Verlag New York, 2010. ISBN 978-0-387-95054-9. doi:10.1007/978-1-4612-0493-0.P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wassersteinmetric: Performance guarantees and tractable reformulations.
Mathematical Programming , 171(1-2):115–166, 2018.J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.
Biostatistics , 9(3):432–441, 12 2007. ISSN 1465-4644. doi:10.1093/biostatistics/kxm045.D. Guillot, B. Rajaratnam, and J. Emile-Geay. Statistical paleoclimate reconstructions via markovrandom fields.
Ann. Appl. Stat. , 9(1):324–352, 03 2015. doi:10.1214/14-AOAS794.C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. Quic: Quadratic approximation for sparseinverse covariance estimation.
Journal of Machine Learning Research , 15:2911–2947, 2014.S. Huang, J. Li, L. Sun, J. Ye, A. Fleisher, T. Wu, K. Chen, and E. Reiman. Learning brain connectivityof Alzheimer’s disease by sparse inverse covariance estimation.
NeuroImage , 50(3):935 – 949, 2010.ISSN 1053-8119. 9. Isii. On sharpness of tchebycheff-type inequalities.
Annals of the Institute of Statistical Mathematics ,14(1):185–197, 1962. doi:10.1007/BF02868641.W. James and C. Stein. Estimation with quadratic loss. In
Proceedings of the Fourth Berkeley Symposiumon Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics , pages361–379, Berkeley, Calif., 1961. University of California Press.J. Jankova and S. van de Geer. Inference in high-dimensional graphical models. 2018.doi:arXiv:1801.08512. URL https://arxiv.org/pdf/1801.08512.pdf .K. Khare, S.-Y. Oh, and B. Rajaratnam. A convex pseudolikelihood framework for high dimensionalpartial correlation estimation with convergence guarantees.
Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 77(4):803–825, sep 2015. doi:10.1111/rssb.12088.J. Krumsiek, K. Suhre, T. Illig, J. Adamski, and F. J. Theis. Gaussian graphical modeling reconstructspathway reactions from high-throughput metabolomics data.
BMC Systems Biology , 5(1):21, Jan 2011.ISSN 1752-0509. doi:10.1186/1752-0509-5-21.C. Lam and J. Fan. Sparsity and rates of convergence in large covariance matrix estimation.
The Annalsof Statistics , 37(6B):4254–4278, 2009.P.-L. Loh and X. L. Tan. High-dimensional robust precision matrix estimation: Cellwise corruption under (cid:15) -contamination.
Electron. J. Statist. , 12(1):1429–1467, 2018. doi:10.1214/18-EJS1427.R. Mazumder and T. Hastie. Exact covariance thresholding into connected components for large-scalegraphical lasso.
J. Mach. Learn. Res. , 13(1):781–794, 2012. ISSN 1532-4435.P. Men´endez, Y. Kourmpetis, C. ter Braak, and F. van Eeuwijk. Gene regulatory networks from multi-factorial perturbations using graphical lasso: Application to the dream4 challenge.
PLOS ONE , 5(12):1–8, 12 2010. doi:10.1371/journal.pone.0014147.V. A. Nguyen, D. Kuhn, and P. M. Esfahani. Distributionally robust inverse covariance estimation: TheWasserstein shrinkage estimator. 2018. doi:arXiv:1805.07194. URL https://arxiv.org/pdf/1805.07194.pdf .J. Peng, P. Wang, N. Zhou, and J. Zhu. Partial correlation estimation by joint sparse regression models.
Journal of the American Statistical Association , 104(486):735–746, 2009. doi:10.1198/jasa.2009.0126.D. M. Powers. Evaluation: From precision, recall and f-factor to roc, informedness, markedness &correlation.
Journal of Machine Learning Technologies , 2:37–63, 2011.B. Rolfs, B. Rajaratnam, D. Guillot, I. Wong, and A. Maleki. Iterative thresholding algorithm for sparseinverse covariance estimation. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,editors,
Advances in Neural Information Processing Systems 25 , pages 1574–1582. Curran Associates,Inc., 2012.A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation.
Electron. J. Statist. , 2:494–515, 2008. doi:10.1214/08-EJS176.S. Shafieezadeh-Abadeh, P. M. Esfahani, and D. Kuhn. Distributionally robust logistic regression. In
Advances in Neural Information Processing Systems 28 , pages 1576–1584. 2015.S. Shafieezadeh-Abadeh, D. Kuhn, and P. M. Esfahani. Regularization via mass transportation.
Journalof Machine Learning Research , 20, 2019. 10. M. Smith, K. L. Miller, G. Salimi-Khorshidi, M. Webster, C. F. Beckmann, T. E. Nichols, J. D.Ramsey, and M. W. Woolrich. Network modelling methods for fMRI.
NeuroImage , 54(2):875 – 891,2011. ISSN 1053-8119.J. M. Steele.
The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities .Cambridge University Press, 2004. doi:10.1017/CBO9780511817106.N. Sulaimanov, S. Kumar, H. Koeppl, F. Burdet, M. Pagni, and M. Ibberson. Inferring gene expressionnetworks with hubs using a degree weighted Lasso approach.
Bioinformatics , 35(6):987–994, 08 2018.ISSN 1367-4803. doi:10.1093/bioinformatics/bty716.J.-H. Won, J. Lim, S.-J. Kim, and B. Rajaratnam. Condition-number-regularized covariance estimation.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 75(3):427–450, dec 2012.doi:10.1111/j.1467-9868.2012.01049.x.L. Xue and H. Zou. Regularized rank-based estimation of high-dimensional paranormal graphical models.
The Annals of Statistics , 40(5):2541–2571, 2012.E. Yang and A. C. Lozano. Robust Gaussian graphical modeling with the trimmed graphical lasso. In
Advances in Neural Information Processing Systems 28 , pages 2602–2610. 2015.S. Yang, Q. Sun, S. Ji, P. Wonka, I. Davidson, and J. Ye. Structural graphical lasso for learning mousebrain connectivity. In
Proceedings of the 21th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining . ACM, 2015. doi:10.1145/2783258.2783391.M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model.
Biometrika , 94(1):19–35, 2007. doi:10.1093/biomet/asm018. 11 =50n=75n=200n=600n=1000n=2500n=10000 -3 -2 -1 -3 -2 -1 F DR =0.10=0.21=0.33=0.44=0.56=0.67=0.79=0.90 C V R S Figure 1: False Detection Rate (FDR) for seven different sample sizes, with both axes in logarithmicscale. For each sample size, the average FDR is plotted for both criteria, cross-validation (CV) andRobSel (RS). For RS, a point is plotted with a different symbol for each different value of the parameter α (some points may not be plotted for lower values of α , since those values gave no true positive detected,and so FDR was not well-defined). m cc n=10000n=1000n=200 m e t hod RS cv Figure 2: Matthews correlation coefficient (MCC) for three different sample sizes. The curves in theupper plot are the average MCC obtained over N = 200 datasets. The lower plot are boxplots for theCV and RS (with parameter α = 0 .
9) methods for the different choices of n .12 Using a different ambiguity set
Recall that X ∈ R d is a zero-mean random vector with covariance matrix Σ ∈ S ++d and measure Q , andthat we consider an iid random sample X , · · · , X n ∼ X , n > d, with empirical measure Q n . Since weuse the graphical loss function as in equation (3), we are interested in finding a tractable or closed-formexpression for the optimization problem sup Q : D c (cid:48) ( Q, Q n ) ≤ δ E Q [ l ( X ; K )] (19)with K ∈ S ++d . Ideally, the solution should be connected to the graphical lasso estimator, since it is oneof the most commonly-used sparse inverse covariance estimators in practice. The ambiguity set in thisformulation is specified by the collection of measures { Q | D (cid:48) c (cid:48) ( Q, Q n ) ≤ δ } , which we now describe. Giventwo probability distributions Q and Q on R d and some transportation cost function c (cid:48) : R d × R d → [0 , ∞ )(which we will specify below), we define the optimal transport cost between Q and Q as D (cid:48) c (cid:48) ( Q , Q ) = inf { E π (cid:2) c (cid:48) ( u, v ) (cid:3) | π ∈ P (cid:16) R d × R d (cid:17) ,π u = Q , π v = Q } (20)where P (cid:0) R d × R d (cid:1) is the set of joint probability distributions π of ( u, v ) supported on R d × R d , and π u and π v denote the marginals of u and v under π , respectively. In this paper, we are interested in costfunctions c (cid:48) ( u, v ) = (cid:107) u − v (cid:107) ρq , (21)with u, v ∈ R d , ρ ≥ q ∈ [1 , ∞ ].Now, observe that the function l ( · ; K ) : R d → R is Borel measurable since it is a continuous function.Then, we use the duality result from Proposition 4 of (Blanchet et al., 2016, version 2) and obtainsup P : D (cid:48) c (cid:48) ( Q, Q n ) ≤ δ E Q (cid:2) l ( X ; K ) (cid:3) = inf γ ≥ (cid:40) γδ + 1 n n (cid:88) i =1 (cid:32) sup u ∈ R d { l ( u ; K ) − γc (cid:48) ( u, X i ) } (cid:33)(cid:41) . (22)Let ∆ := u − X i . Then sup u ∈ R d { l ( u ; K ) − γc ( u, X i ) } = sup u ∈ R d { u T Ku − log | K | − γ (cid:107) u − X i (cid:107) ρq } = sup ∆ ∈ R d { (∆ + X i ) T K (∆ + X i ) − γ (cid:107) ∆ (cid:107) ρq } − log | K | . (23)Replacing this expression back in (22), it may be difficult, if not impossible, to obtain a closed formoptimization problem over K . Even if such a simplification is possible, it will not provide the desiredconnection to the graphical lasso estimator. That is why, in this paper, as outlined in Section 2, we redefinethe ambiguity set to obtain a desired closed form as expressed in Theorem 2.1 in a more transparent way. B Proofs for the paper
B.1 Proof of Theorem 2.1
Proof.
Consider K ∈ S ++d . Observe that the function l ( · ; K ) : S d → R is Borel measurable since it is acontinuous function. Then, we use the duality result for the DRO formulation from Proposition C.2 fromthe appendix of this paper and obtainsup P : D c ( P, P n ) ≤ δ E P (cid:2) l ( W ; K ) (cid:3) = inf γ ≥ (cid:40) γδ + 1 n n (cid:88) i =1 (cid:32) sup W ∈ S d { l ( W ; K ) − γc ( W, W i ) } (cid:33)(cid:41) . (24)13et ∆ := W − W i . Then,sup W ∈ S d { l ( W ; K ) − γc ( W, W i ) } = sup W ∈ S d { trace( KW ) − log | K | − γ (cid:107) vec ( W ) − vec ( W i ) (cid:107) ρq } = sup ∆ ∈ S d { trace( K (∆ + W i )) − γ (cid:107) vec (∆) (cid:107) ρq } − log | K | = sup ∆ ∈ S d { trace( K ∆) − γ (cid:107) vec (∆) (cid:107) ρq } + trace( KW i ) − log | K | = sup ∆ ∈M ( K ) {(cid:107) vec (∆) (cid:107) q (cid:107) vec ( K ) (cid:107) p − γ (cid:107) vec (∆) (cid:107) ρq } + trace( KW i ) − log | K | (25)with M ( K ) = { ∆ ∈ S d | trace( K ∆) > , | ∆ ij | q = θ | k ij | p for some θ > } so that the fourth line followsfrom selecting a ∆ ∈ S d (since K ∈ S ++d ) such that Holder’s inequality holds tightly (with p + q = 1).In fact, Holder’s inequality holds tightly if and only if ∆ ∈ M ( K ) (Steele, 2004, Chapter 9), even for thelimiting case q = ∞ , p = 1. Observe that there exist multiple ∆ ∈ S d that can satisfy Holder’s inequalitytightly. As a consequence, we are still free to choose the magnitude of the q -norm of such vec (∆) (andthis is what we will use next).Now, the argument inside the supremum in the last line of (25) is a polynomial function on (cid:107) vec (∆) (cid:107) q .We have to analyze two cases. Case 1: ρ = 1 . In this case we observe that, by setting (cid:15) ( γ, K ) = sup ∆ ∈M ( K ) {(cid:107) vec (∆) (cid:107) q ( (cid:107) vec ( K ) (cid:107) p − γ ) } : • if γ ≥ (cid:107) vec ( K ) (cid:107) p , then (cid:15) ( γ, K ) = 0 (in particular, if γ = (cid:107) vec ( K ) (cid:107) p , the optimizer is ∆ = d × d ); • if γ < (cid:107) vec ( K ) (cid:107) p , then (cid:15) ( γ, K ) = ∞ ;so that, recalling (24), due to the outside infimum to be taken over γ ≥
0; and so we must have thatsup P : D c ( P, P n ) ≤ δ E P (cid:2) l ( W ; K ) (cid:3) = inf γ ≥(cid:107) vec ( K ) (cid:107) p (cid:40) γδ + 1 n n (cid:88) i =1 (trace( KW i ) − log | K | ) (cid:41) from which we immediately obtain (8). Case 2: ρ > . By differentiation and basic calculus (e.g., using the first and second derivative test)we obtain that the maximizer∆ ∗ = arg sup ∆ ∈M ( K ) {(cid:107) vec (∆) (cid:107) q (cid:107) vec ( K ) (cid:107) p − γ (cid:107) vec (∆) (cid:107) ρq } is such that (cid:107) vec (∆ ∗ ) (cid:107) q = (cid:16) (cid:107) vec ( K ) (cid:107) p γρ (cid:17) ρ − . Then,sup W ∈ S d { l ( W ; K ) − γc ( W, W i ) } = (cid:107) vec ( K ) (cid:107) ρρ − p ( γρ ) ρ − − γ (cid:18) (cid:107) vec ( K ) (cid:107) p γρ (cid:19) ρρ − + trace( KW i ) − log | K | = (cid:107) vec ( K ) (cid:107) ρρ − p ρ − ρ ρρ − γ ρ − + trace( KW i ) − log | K | . (26)14eplacing this back in (24),sup P : D c ( P, P n ) ≤ δ E P (cid:2) l ( W ; K ) (cid:3) = inf γ ≥ (cid:40) γδ + 1 n n (cid:88) i =1 (cid:32) (cid:107) vec ( K ) (cid:107) ρρ − p ρ − ρ ρρ − γ ρ − + trace( KW i ) − log | K | (cid:33)(cid:41) = inf γ ≥ (cid:40) γδ + (cid:107) vec ( K ) (cid:107) ρρ − p ρ − ρ ρρ − γ ρ − + trace (cid:32) K n n (cid:88) i =1 W i (cid:33) − log | K | (cid:41) = inf γ ≥ (cid:40) γδ + (cid:107) vec ( K ) (cid:107) ρρ − p ρ − ρ ρρ − γ ρ − (cid:41) + trace( KA n ) − log | K | . (27)Now, we observe that the argument inside the infimum in the last line of (27) is a function that grows toinfinity when γ → γ → ∞ , so that the minimum is attained for some optimal γ . By using the firstand second derivative tests, we obtain that the minimizer is γ ∗ = (cid:107) vec ( K ) (cid:107) p ρδ ρ − ρ . Then, replacing this backin (27) and then this in (24), we finally obtain (8) after some algebraic simplification. B.2 Proof of Lemma 3.1
Proof.
Consider K ∈ S ++d and define g ( K ) = sup P : D c ( P, P n ) ≤ δ E P (cid:2) l (cid:0) W ; K (cid:1)(cid:3) for a fixed δ . We prove (12),by a direct application of Proposition 8 of (Blanchet et al., 2016, Appendix C), observing that we satisfythe three conditions for its application:(i) g is convex on S ++d and finite,(ii) there exists b ∈ R such that the sublevel set κ b = { K | g ( K ) ≤ b } is compact and non-empty,(iii) E P [ l ( W ; K )] is lower semi-continuous and convex as a function of K throughout κ b for any P ∈{ P | D c ( P, P n ) ≤ δ } .First, we observe that E P [ l ( W, K )] = E P [trace( KW ) − log | K | ≤ trace( KE P [ W ]) < ∞ , since E Q ( (cid:107) X (cid:107) ) < ∞ . Then, using Theorem 2.1, the function g ( K ) = sup P : D c ( P, P n ) ≤ δ E P (cid:2) l (cid:0) W ; K (cid:1)(cid:3) = trace( KA n ) − log | K | + δ /ρ (cid:107) vec ( K ) (cid:107) p is finite. Moreover, we also claim it is convex for all K ∈ S ++d . This follows from the fact thattrace( KA n ) − log | K | and (cid:107) vec ( K ) (cid:107) p , p ∈ [1 , ∞ ] are two convex functions on K ∈ S ++d , and fromthe fact that the nonnegative weighted sum of two convex functions is another convex function (Boydand Vandenberghe, 2004). This proves (i).Now, we claim that g ( K ) is radially unbounded, i.e., g ( K ) → ∞ as (cid:107) vec ( K ) (cid:107) p → ∞ . To see this,recall that trace( KA n ) − log | K | is a differentiable convex function in K that is minimized whenever K − = A n , since A n is invertible for n > d almost surely. Then, g ( K ) = trace( KA n ) − log | K | + δ /ρ (cid:107) vec ( K ) (cid:107) p ≥ d − log | A − n | + δ /ρ (cid:107) vec ( K ) (cid:107) p , from which it immediately follows that g ( K ) → ∞ as (cid:107) vec ( K ) (cid:107) p → ∞ .Now, again, since g ( K ) is also convex and continuous in S ++d , we conclude that the level sets κ b = { K | g ( K ) ≤ b } are compact and nonempty as long as b > l ( W ; K ) + δ /ρ (cid:107) vec ( K ) (cid:107) p . This proves (ii).Moreover, since l ( W ; K ) is convex and continuous on any K ∈ S ++d , it follows that E P [ l ( W ; K )] for any P ∈ { P | D c ( P, P n ) ≤ δ } is also continuous and convex on any K ∈ S ++d , thus (iii) follows immediately.15 .3 Proof of Theorem 3.2 Proof.
Consider K ∈ S ++d . Setting h ( U ; K ) = U − K − , it is clear that we satisfy the conditions forapplying Proposition C.1 in the Appendix, and so we obtain R n ( K ) = sup Λ ∈ S d (cid:40) − n n (cid:88) i =1 sup U ∈ S d { trace(Λ (cid:62) ( U − K − )) − (cid:107) vec ( U ) − vec ( W i ) (cid:107) ρq } (cid:41) (28)Now, letting ∆ := U − W (cid:62) i sup U ∈ S d { trace(Λ (cid:62) ( U − K − )) − (cid:107) vec ( U ) − vec ( W i ) (cid:107) ρq } = sup ∆ ∈ S d { trace(Λ (cid:62) (∆ + W i − K − )) − (cid:107) vec (∆) (cid:107) ρq } = sup ∆ ∈ S d { trace(Λ (cid:62) ∆) − (cid:107) vec (∆) (cid:107) ρq } + trace(Λ (cid:62) ( W i − K − ))= sup ∆ ∈M (Λ) {(cid:107) vec (Λ) (cid:107) p (cid:107) vec (∆) (cid:107) q − (cid:107) vec (∆) (cid:107) ρq } + trace(Λ (cid:62) ( W i − K − )) (29)with M (Λ) as in the proof of Theorem 2.1, so that the third line follows from selecting a ∆ ∈ S d suchthat Holder’s inequality holds tightly (with p + q = 1), whose existence has been explained in the proofof Theorem 2.1. Thus, we are still free to choose the magnitude of the q -norm of such vec (∆) (and thisis what we will use next).Now, the argument inside the supremum in the last line of (29) is a polynomial function on (cid:107) vec (∆) (cid:107) q .We have to analyze two cases. Case 1: ρ = 1 . In this case we observe that, by setting (cid:15) (Λ) = sup ∆ ∈ S d {(cid:107) vec (∆) (cid:107) q ( (cid:107) vec (Λ) (cid:107) p − } : • if (cid:107) vec (Λ) (cid:107) p ≤
1, then (cid:15) (Λ) = 0 (in particular, if (cid:107) vec (Λ) (cid:107) p <
1, the optimizer is ∆ = d × d ); • if (cid:107) vec (Λ) (cid:107) p >
1, then (cid:15) (Λ) = ∞ ;so that, recalling (28), we see that if (cid:107) vec (Λ) (cid:107) p >
1, then R n ( K ) = −∞ . Then, we obtain that R n ( K ) = sup Λ ∈ S d : (cid:107) vec (Λ) (cid:107) p ≤ (cid:40) − n n (cid:88) i =1 trace(Λ (cid:62) ( W i − K − ) (cid:41) = sup Λ ∈ S d : (cid:107) vec (Λ) (cid:107) p ≤ (cid:110) − trace(Λ (cid:62) ( A n − K − ) (cid:111) = sup Λ ∈ S d : (cid:107) vec (Λ) (cid:107) p ≤ (cid:110) vec (Λ) (cid:62) vec ( A n − K − ) (cid:111) = (cid:13)(cid:13) vec ( A n − K − ) (cid:13)(cid:13) q where the third line results from the fact that Λ is a free variable so we can flip its sign, and the lastline follows from the the analysis of conjugate norms and the fact that Λ , A n − K − ∈ S d . We thusobtained (16). Case 2: ρ > . By differentiation and basic calculus (e.g., using the first and second derivative test)we obtain that the maximizer∆ ∗ = arg sup ∆ ∈ S d {(cid:107) vec (∆) (cid:107) q (cid:107) vec (Λ) (cid:107) p − γ (cid:107) vec (∆) (cid:107) ρq }
16s such that (cid:107) vec (∆ ∗ ) (cid:107) q = (cid:16) (cid:107) vec ( K ) (cid:107) p ρ (cid:17) ρ − . Then, replacing this back in (28), R n ( K ) = sup Λ ∈ S d (cid:40) − n n (cid:88) i =1 (cid:32) (cid:107) vec (Λ) (cid:107) ρρ − p ρ − ρ ρρ − + trace(Λ (cid:62) ( W i − K − )) (cid:33)(cid:41) = sup Λ ∈ S d (cid:40) − (cid:107) vec (Λ) (cid:107) ρρ − p ρ − ρ ρρ − − trace(Λ (cid:62) ( A n − K − )) (cid:41) = sup Λ ∈ S d (cid:40) trace(Λ (cid:62) ( A n − K − )) − (cid:107) vec (Λ) (cid:107) ρρ − p ρ − ρ ρρ − (cid:41) = sup Λ ∈ S d (cid:40) (cid:107) vec (Λ) (cid:107) p (cid:13)(cid:13) vec ( A n − K − ) (cid:13)(cid:13) q − (cid:107) vec (Λ) (cid:107) ρρ − p ρ − ρ ρρ − (cid:41) . Again, by differentiation and basic calculus, we obtain that the maximizer Λ ∗ is such that (cid:107) vec (Λ ∗ ) (cid:107) p = ρ (cid:13)(cid:13) vec ( A n − K − ) (cid:13)(cid:13) ρ − q . Replacing this value back in our previous expression, we get that R n ( K ) = (cid:13)(cid:13) vec ( A n − K − ) (cid:13)(cid:13) ρq , and thus we showed (16). C Applicability of the dual representations of the RWP function andthe DRO formulation
The dual representations of the RWP function and the DRO formulation for the case in which the spaceof probability measures is P ( R d × R d ) is studied in the paper (Blanchet et al., 2016). In this paper, weare interested in the case P ( S d × S d ). In other words, we consider the samples to be in S d instead of R d .We want to emphasize that the derivations of these dual representations rely on the dual formulation ofthe so called “problem of moments” or a specific class of “Chebyshev-type inequalities” referenced in thework by Isii (1962). The derivation by Isii is actually more general in the sense that is applied to moregeneral probability spaces than the ones used in this paper and in (Blanchet et al., 2016) (in fact, it isstated for general spaces of non-negative measures).Throughout this section, we consider an integrable function h : S d × S d → S d , and a lower semi-continuous function c : S d × S d → [0 , ∞ ) such that c ( U, U ) = 0 for any U ∈ S d and such that theset Ω := (cid:8) ( U (cid:48) , W (cid:48) ) ∈ S d × S d | c ( U (cid:48) , W (cid:48) ) < ∞ (cid:9) is Borel measurable and non-empty. Also consider an iid random sample W , · · · , W n ∼ W with W coming from a distribution on P ( S d ).Now, let us focus first on the RWP function in the following proposition which parallels (Blanchetet al., 2016, Proposition 3). Proposition C.1.
Consider K ∈ S ++d . Let h ( · , K ) be Borel measurable. Also, suppose that d × d lies inthe interior of the convex hull of { h ( U (cid:48) , C ) | U (cid:48) ∈ S d } . Then, R n ( K ) = sup Λ ∈ S d (cid:40) − n n (cid:88) i =1 sup U ∈ S d { trace(Λ (cid:62) h ( U ; K )) − c ( U, W i ) } (cid:41) . Proof.
Consider the proof of (Blanchet et al., 2016, Proposition 3). If we: • set the estimating equation by E [ h ( W ; K )] = d × d , • set R n ( K ) = inf { E π [ c ( U, W )] | E π [ h ( U ; K )] = n × n , π W = P n , π ∈ P ( S d × S d ) } , with π W denoting the marginal distribution of W ,17 consider the previously defined Ω,then we obtain that, following the rest of this proof (and using (Isii, 1962, Theorem 1) with its specialcase): R n ( K ) = sup Λ ∈ S d (cid:40) − n n (cid:88) i =1 sup U ∈ S d { vec (Λ) (cid:62) vec ( h ( U ; K )) − c ( U, W i ) } (cid:41) = sup Λ ∈ S d (cid:40) − n n (cid:88) i =1 sup U ∈ S d { trace(Λ (cid:62) h ( U ; K )) − c ( U, W i ) } (cid:41) , thus obtaining the dual representation of the RWP function.The following proposition for the dual representation of the DRO formulation parallels (Blanchetet al., 2016, Proposition 1). Proposition C.2.
For γ ≥ and loss functions l ( U (cid:48) ; K ) that are upper semi-continuous in U (cid:48) ∈ S d foreach K ∈ S ++d , let φ γ ( W i ; K ) = sup U ∈ S d { l ( U ; K ) − γc ( U, W i )) } . (30) Then sup P : D c ( P, P n ) ≤ δ E P (cid:2) l ( W ; K ) (cid:3) = min γ ≥ (cid:40) γδ + 1 n n (cid:88) i =1 φ γ ( W i ; K ) (cid:41) . Proof.
The proof for the dual representation of the DRO for our domain of symmetric matrices is alsovery similar to the one described in Proposition 4 of (Blanchet et al., 2016, version 2), just by followingappropriate similar changes as we did for the proof of C.1.
D Figures, tables and additional analysis for the numerical results(Section 4 of the paper)
D.1 Matthews correlation coefficient analysis
Let true positives (TP) be the number of nonzero off-diagonal entries of Ω that are correctly identified,false negatives (FN) be the number of its nonzero off-diagonal entries that are incorrectly identified aszeros, false positives (FP) be the number of its zero off-diagonal entries that are incorrectly identifiedas nonzeros, and true negatives (TN) be the number of its zero off-diagonal entries that are correctlyidentified. Given the estimated precision matrix ˆ K , the Matthews correlation coefficient (MCC) (Powers,2011) is defined as: M CC = T P · T N − F P · F N (cid:112) ( T P + F P )( T P + F N )( T N + F P )( T N + F P ) , (31)and, whenever the denominator is zero, it can be arbitrarily set to one. It can be shown that M CC ∈ [ − , +1], and +1 is interpreted as a perfect prediction (of both zero and nonzero values), 0 is interpretedas prediction no better than a random one, and − .2 Regularization parameters All the plots related to the RS criterion for choosing λ have in their x-axis the values α ∈ { . , . , . , . , . , . , . , . } . We study the cases for sample sizes n ∈ { , , } . (a) RS (b) CV Figure 3: n = 75 (a) RS (b) CV Figure 4: n = 200 (a) RS (b) CV Figure 5: n = 1000 D.3 Performance figures for different choices of the regularization parameter
All the plots related to the RS and RWP criteria for choosing λ have in their x-axis the values α ∈ { . , . , . , . , . , . , . , . } . We study the cases for sample sizes n ∈ { , , } .19 .3.1 n = 75 (a) RS (b) CV (c) RWP Figure 6: True positive rate (%) (a) RS (b) CV (c) RWP
Figure 7: False detection rate (%)
D.3.2 n = 200 (a) RS (b) CV (c) RWP Figure 8: True positive rate (%)20 .10 0.21 0.33 0.44 0.56 0.67 0.79 0.9000.050.1 (a) RS (b) CV (c) RWP
Figure 9: False detection rate (%)
D.3.3 n = 1000 (a) RS (b) CV (c) RWP Figure 10: True positive rate (%) (a) RS (b) CV (c) RWP(c) RWP