A positive-definiteness-assured block Gibbs sampler for Bayesian graphical models with shrinkage priors
AAn Efficient Gibbs Sampling Algorithm forBayesian Graphical LASSO Models with thePositive Definite Constraint on the PrecisionMatrix ∗ Sakae Oya † Graduate School of Economics, Keio UniversityandTeruo NakatsumaFaculty of Economics, Keio University
Abstract
Wang (2012) proposed a novel Gibbs sampling algorithm for Bayesiananalysis of Gaussian graphical LASSO models. In this paper, we pro-pose a modification to Wang(2012)’s algorithm so that the precisionmatrix in a graphical LASSO model would never fail to be positivedefinite in every cycle of the Gibbs sampling procedure. Our proposedalgorithm is designed to sample the off-diagonal elements of the preci-sion matrix exactly from the region where the precision matrix remainspositive definite. As a result, it is more stable in the sense that thesampling procedure will not halt due to numerical exceptions relatedto the lack of positive definiteness. The simulation results show thatour proposed algorithm can significantly improve the performance ofparameter estimation and graphical structure learning.
Keywords:
Hit-and-Run algorithm, Gibbs sampler, Positive definite-ness, Precision matrix ∗ This research is supported by JSPS Grants-in-Aid for Scientific Research Grant Num-ber 19K01592 and the Keio Economic Society. † [email protected] a r X i v : . [ s t a t . C O ] J a n Introduction
Suppose Y is a ( n × p ) data matrix of p variables and n observationsand the t -th row vector of Y , y t (1 (cid:53) t (cid:53) n ), follows a multivariate normaldistribution N ( , Ω − ) where Ω = ( ω ij ), (1 (cid:53) i, j (cid:53) p ) is the inverse of thecovariance matrix and called the precision matrix. In the multivariate normaldistribution, ω ij = 0 implies that y ti and y tj are independent. Therefore aset of non-zero off-diagonal elements in Ω constitutes an undirected graphicalstructure among ( y t , . . . , y tp ), which is called the Gaussian graphical model.We may estimate Ω by maximizing the log likelihood: (cid:96) ( Ω ) = − np π − n | Ω | −
12 tr ( S Ω ) , (1)where S = ( s ij ) = Y (cid:124) Y . In practice, however, the MLE with (1) does notproduce estimates of off-diagonal ω ij ’s that are exactly equal to zero. Toobtain “zero estimates” of ω ij ’s, we may employ a LASSO-type penalizedMLE: max Ω ∈ M + n | Ω | −
12 tr ( S Ω ) − λ (cid:107) Ω (cid:107) , (2)where (cid:107) Ω (cid:107) = (cid:80) i (cid:53) j | ω ij | and M + is the subset of the parameter space of Ω in which Ω is a positive definite precision matrix. The solution of (2) iscalled the graphical LASSO estimator and there have been many researcheson this model in recent years, including Meinshausen and B¨uhlmann (2006),Yuan and Lin (2007), Banerjee et al. (2008), Friedman et al. (2008), Guoet al. (2011) among others.Note that the penalty in (2) is equivalent to the logarithm of p ( ω ij ) = (cid:40) λe − λω ii , ( i = j ); λ e − λ | ω ij | , ( i (cid:54) = j ) . (3)From the viewpoint of Bayesian statistics as in Marlin et al. (2009) and Marlinand Murphy (2009), the graphical LASSO estimator is a MAP (maximuma posteriori) estimator of Ω in which the prior distribution of each diagonalelement is exponential and that of each off-diagonal element is Laplace as in(3). This is a natural extension of the original Bayesian LASSO by Park andCasella (2008) who extended the LASSO regression by Tibshirani (1996) toa Bayesian counterpart. 2ased on this interpretation, Wang (2012) and Khondker et al. (2013)independently proposed Markov chain sampling algorithms to generate theprecision matrix Ω from its posterior distribution. Wang (2012) developed aGibbs sampling algorithm while Khondker et al. (2013) devised a Metropolis-Hastings algorithm which could generate a positive definite precision matrix.In this paper, we explore Wang (2012)’s approach since it is a pure Gibbssampler and does not suffer from the problem of a low acceptance rate evenif the dimension of Ω is high.Let us preview Wang (2012)’s algorithm (block Gibbs sampler) briefly,though we will discuss it in detail later in Section 2. Wang (2012)’s blockGibbs sampler generates the i -th diagonal element ω ii and the off-diagonalelements in the i -th column (or row) alternatively in the following fashion. Block Gibbs sampler for the precision matrix (cid:19) (cid:16)
For i = 1 , . . . , p , repeat Step 1 to Step 3 . Step 1:
Partition Ω into the i -th diagonal element ω ii , the off-diagonalelements ( ω i , . . . , ω i − ,i , ω i +1 ,i , . . . , ω pi ) and the rest. Step 2:
Generate ω ii from the full conditional posterior distribution. Step 3:
Generate ( ω i , . . . , ω i − ,i , ω i +1 ,i , . . . , ω pi ) from the full condi-tional posterior distribution. (cid:18) (cid:17) These full conditional posterior distributions will be derived in Section 2.Since Wang (2012)’s block Gibbs sampler enables us to generate Ω fromthe posterior distribution so easily, it has become an indispensable buildingblock for recent applied researches in Bayesian analysis of Gaussian graphicalmodels. For example, Li et al. (2019) proposed a Gibbs sampling algorithmfor Bayesian graphical horseshoe models, which is a natural extension ofWang (2012)’s block Gibbs sampler for the Bayesian graphical LASSO.Although this block Gibbs sampler is nice and elegant, the precision ma-trix Ω generated with this sampling algorithm is not necessarily positivedefinite because it does not generate the off-diagonal elements of Ω from M + in Step 3 . To demonstrate our point, we run Monte Carlo experiments sim-ilar to the ones conducted by Wang (2012). We generate data sets with sixdifferent graph structures (AR(1), AR(2), Block, Star, Circle and Full) andtwo different dimensions ( p = 30 , able 1: the number of violations in the positive definiteness of Ω p AR(1) AR(2) Block Star Circle Full30 7,644 561 12 27 77,768 14(2.55) (0.19) (0.00) (0.01) (25.88) (0.00)100 566 9 0 2,524 205,093 0(0.06) (0.00) (0.00) (0.25) (20.51) (0.00)Notes: (a) the number of generated Ω ’s is p × , in which the shrinkage parameter λ maydiffer from element to element in Ω . We will go into details about the designof each Monte Carlo experiment in Section 4. The number of iterations inthe block Gibbs sampler is 10,000 for each experiment. Thus, if we countevery Ω that is partially updated in Step 1 to Step 3 as a distinctive one,we have 300,000 ( p = 30) or 10,000,000 ( p = 100) replications of Ω in oneexperiment. The results of the Monte Carlo experiments are summarized inTable 1. In the case of p = 30, the violation of positive definiteness occurs inall designs. In particular, about one quarter of generated Ω ’s do not satisfythe positive definiteness in the Circle design. In the case of p = 100, theviolation of positive definiteness is less severe for some designs, but the ratioof violation is still high (20.51%) in the Circle design.To address this issue, we propose to improve Wang (2012)’s block Gibbssampler so that generated Ω should never fail to be positive definite. Al-though it seems too intractable to guarantee the positive definiteness of Ω ineach cycle of the block Gibbs sampler, it turns out that the Hit-and-Run algo-rithm by B´elisle et al. (1993) is applicable to the Bayesian (adaptive) graph-ical LASSO in a fairly straightforward manner, and the resultant algorithmis a pure Gibbs sampler without the Metropolis-Hastings step. Therefore ourproposed algorithm enjoys the same efficiency as Wang (2012)’s algorithm,yet it can prevent Ω from violating positive definiteness.The main body of this paper is organized as follows. In Section 2, wewill briefly review Wang (2012)’s Gibbs sampling algorithm for the Bayesianadaptive graphical LASSO, though Wang (2012) also derived the algorithmfor the Bayesian graphical LASSO with the common shrinkage parameter. We will explain the Bayesian adaptive LASSO in Section 2. In our experience, theviolation of positive definiteness occurs whether it is adaptive or not.
In this section, we briefly review a Gibbs sampling algorithm developedby Wang (2012). Although Wang (2012) derived it for the Bayesian graphicalLASSO with the prior distribution (3) as well, we consider a more generalprior setting that allows λ in (3) to vary for each element of the precisionmatrix Ω , namely, p ( ω ij ) = (cid:40) λ ii e − λ ii ω ii , ( i = j ); λ ij e − λ ij | ω ij | , ( i (cid:54) = j ) , (4)which is called the adaptive graphical LASSO. Since Wang (2012) demon-strated that the Bayesian adaptive LASSO outperformed the non-adaptivecounterpart in terms of parameter estimation and graphical structure learn-ing, we will illustrate the Gibbs sampling algorithm for the Bayesian adaptivegraphical LASSO in detail.To derive the Gibbs sampling algorithm, Wang (2012) utilizes the well-known fact that the Laplace distribution in (4) is expressed as a scale mixtureof normal distributions with the exponential distribution: ω ij | τ ij ∼ N (0 , τ ij ) , τ ij ∼ E xp (cid:18) λ ij (cid:19) . (5)By using the gamma distribution G a ( r, s ) as the common prior for λ ij (1 (cid:53) i (cid:53) j (cid:53) p ), we obtain the joint posterior distribution of ω = { ω ij } i (cid:53) j , Wang (2012) assumes that the prior distribution of each diagonal element ω ii is λ ii exp (cid:0) − λ ii ω ii (cid:1) , instead of λ ii exp ( − λ ii ω ii ). This is because Wang (2012) employs (cid:107) Ω (cid:107) = (cid:80) pi =1 (cid:80) pj =1 | ω ij | as the penalty in which each off-diagonal element ω ij ( i (cid:54) = j )appears twice. On the other hand, ours is (cid:107) Ω (cid:107) = (cid:80) pi =1 (cid:80) ij =1 | ω ij | that includes thelower triangular part of Ω only. = { τ ij } i 12 tr( S Ω ) (cid:21) p (cid:89) i =1 λ ii e − λ ii ω ii × (cid:89) i 1) matrix, ω is a ( p − × 1) vector and ω isa scalar. Without loss of generality, we can rearrange rows and columns of6 so that the lower-right corner of Ω , ω , should be the diagonal elementto be generated from its full conditional posterior distribution. Likewise, wepartition S , Υ and λ as S = (cid:20) S s s (cid:124) s (cid:21) , Υ = (cid:20) Υ τ τ (cid:124) (cid:21) , λ = (cid:20) λ λ (cid:21) , (10)where Υ is a ( p × p ) symmetric matrix in which the off-diagonal ( i, j ) elementis τ ij and all diagonal elements are equal to zero while λ is the element in λ that corresponds with the diagonal element ω in the prior distribution(4).With the partition of Ω in (9) and that of S in (10), we havetr ( S Ω ) = s ω + 2 s (cid:124) ω + tr ( S Ω ) , and | Ω | = (cid:12)(cid:12) ω − ω (cid:124) Ω − ω (cid:12)(cid:12) | Ω | . Then the likelihood can be expressed as p ( Y | Ω ) ∝ | Ω | n exp (cid:20) − 12 tr( S Ω ) (cid:21) ∝ (cid:12)(cid:12) ω − ω (cid:124) Ω − ω (cid:12)(cid:12) n | Ω | n × exp (cid:20) − { s ω + 2 s (cid:124) ω + tr ( S Ω ) } (cid:21) . (11)By applying a change of variables,( ω , ω ) −→ (cid:0) β = ω , γ = ω − ω (cid:124) Ω − ω (cid:1) , (12)to the likelihood (11), we have p ( Y | Ω ) ∝ γ n exp (cid:20) − (cid:8) s γ + s β (cid:124) Ω − β + 2 s (cid:124) β + tr( S Ω ) (cid:9)(cid:21) ∝ γ N exp (cid:20) − (cid:110) s γ + s β (cid:124) Ω − β + 2 s β (cid:111)(cid:21) . (13)7ith the adaptive prior (4) and the flat prior p ( γ ) ∝ constant, Wang (2012)proposes to use γ | θ − γ , Y ∼ G a (cid:16) n , s λ (cid:17) , (14) β | θ − β , Y ∼ N ( − Cs , C ) , (15) C = (cid:8) ( s + 2 λ ) Ω − + D − τ (cid:9) − , D τ = diag( τ ) , as the full conditional posterior distribution of γ and β .In summary, Wang (2012)’s block Gibbs sampler is given as follows. Block Gibbs sampler for all parameters (cid:19) (cid:16) For i = 1 , . . . , p , repeat Step 1 to Step 5 . Step 1: Rearrange Ω , S , Υ and λ so that ω ii is in the place of ω in Ω and partition them as in (9) and (10). Step 2: γ ← G a (cid:0) n + 1 , s + λ (cid:1) and set ω = γ + ω Ω − ω . Step 3: If i (cid:61) β ← N ( − Cs , C ) and set ω = β . Step 4: λ ← G a ( r + 1 , s + | ω | ). Step 5: υ ← IG (cid:16) λ | ω | , λ (cid:17) and set τ = 1 /υ . (cid:18) (cid:17) As we pointed out in the introduction, Wang (2012)’s block Gibbs samplerdoes not necessarily guarantee the positive definiteness of generated Ω ’s.In this section, we directly derive the full conditional posterior distributionof ω and ω without the transformation (12), and propose an efficientsampling method to generate them under the positive definiteness constraint: Ω ∈ M + .First, let us derive the full conditional posterior distribution of ω . Giventhat Ω from the previous iteration of the block Gibbs sampler is positivedefinite, newly generated ω and ω must satisfy ω > ω (cid:124) Ω − ω , (16)8o ensure that the updated Ω is also positive definite. In other words, theconditional prior distribution of ω given ω and Ω must be p ( ω | ω , Ω ) ∝ λ exp ( − λ ω ) M +22 ( ω ) , (17)where M +22 = { ω : ω > ω (cid:124) Ω − ω } . Therefore, by ignoring the partsthat do not depend on ω in (11), we have p ( ω | θ − ω , Y ) ∝ (cid:12)(cid:12) ω − ω (cid:124) Ω − ω (cid:12)(cid:12) n exp (cid:16) − s ω (cid:17) × exp ( − λ ω ) M +22 ( ω ) ∝ (cid:12)(cid:12) ω − ω (cid:124) Ω − ω (cid:12)(cid:12) n exp (cid:20) − s + 2 λ (cid:0) ω − ω (cid:124) Ω − ω (cid:1)(cid:21) M +22 ( ω ) . (18)The full conditional posterior distribution of ω in (18) is the shifted gammadistribution: ω = u + ω (cid:124) Ω − ω , u ∼ G a (cid:16) n , s λ (cid:17) . (19)Obviously, the distribution of u in (19) is equivalent to that of γ in (14).Thus (19) and (14) are basically identical to each other, and ω generatedfrom either (19) or (14) always satisfies the positive definiteness condition(16).Next, let us derive the full conditional posterior distribution of ω . Forthe same reason as (17), the conditional prior distribution of ω must bethe following truncated multivariate normal distribution: p ( ω | ω , Ω ) ∝ exp (cid:18) − ω (cid:124) D − τ ω (cid:19) M +12 ( ω ) , (20)where M +12 = { ω : ω > ω (cid:124) Ω − ω } . As a result, the full conditionalposterior distribution of ω is also the truncated multivariate normal distri-bution: ω | θ − ω , Y ∼ N ( − Cs , C ) M +12 ( ω ) . (21)On the other hand, Wang (2012) proposes to use the unconstrained mul-tivariate normal distribution (15), which does not impose the truncation M +12 ( ω ), to generate ω (= β ). Consequently, if we generate ω from915), there is no guarantee that the newly updated ω satisfies the positivedefiniteness condition (16). This is the reason why generated Ω ’s are not al-ways positive definite as shown in Table 1. Therefore, in order to ensure thepositive definiteness of Ω , it is preferable to use the truncated multivariatenormal distribution (21) in the block Gibbs sampler.Since either naive rejection method or Metropolis-Hastings algorithm isinefficient even for a modest-size graphical model, we apply the Hit-and-Run algorithm (B´elisle et al. (1993)) to generate ω from the truncatedmultivariate normal distribution (19). Hit-and-Run algorithm (cid:19) (cid:16) Step 1: Pick a point α on the unit sphere randomly as α = z (cid:107) z (cid:107) , z ∼N ( , I ). Step 2: Generate a random scalar κ from the distribution with the den-sity: f ( κ ) ∝ p ( ω + κ α ) M +12 ( ω + κ α ) , (22)where p ( · ) is the density of N ( − Cs , C ) in (21). Step 3: Set ω + κ α as the new ω . (cid:18) (cid:17) It is straightforward to show that the distribution of κ in (22) is κ ∼ N (cid:0) µ κ , σ κ (cid:1) M +12 ( ω + κ α ) , (23)where µ κ = − s (cid:124) α + ω (cid:124) C − αα (cid:124) C − α , σ κ = 1 α (cid:124) C − α . The indicator function M +12 ( ω + κ α ) is equal to one if and only if( ω + κ α ) (cid:124) Ω − ( ω + κ α ) − ω < . This means that κ must satisfy (cid:0) α (cid:124) Ω − α (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) a κ + 2 (cid:0) ω (cid:124) Ω − α (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) b κ + ω (cid:124) Ω − ω − ω (cid:124) (cid:123)(cid:122) (cid:125) c < . Note that a > c < Ω is positive definite, whichimplies that the quadratic equation aκ + 2 bκ + c = 0 has two distinctive real10oots. Therefore the distribution in (23) is the truncated univariate normaldistribution on the interval: R + = (cid:26) κ : − b − √ b − aca < κ < − b + √ b − aca (cid:27) . Hence, by using the Hit-and-Run algorithm, sampling from the seeminglyintractable distribution (19) is reduced to sampling from the truncated uni-variate normal distribution: κ ∼ N (cid:0) µ κ , σ κ (cid:1) R + ( κ ) , and the sampling procedure becomes far much simpler.By replacing (14) in Step 2 with (19) and (15) in Step 3 with the Hit-and-Run algorithm, we have the modified block Gibbs sampler as follows. Modified block Gibbs sampler (cid:19) (cid:16) For i = 1 , . . . , p , repeat Step 1 to Step 5 . Step 1: Rearrange Ω , S , Υ and λ so that ω ii is in the place of ω in Ω and partition them as in (9) and (10). Step 2: u ← G a (cid:0) n + 1 , s + λ (cid:1) and set ω = u + ω Ω − ω . Step 3: If i (cid:61) z ← N ( , I ) ans set α = z (cid:107) z (cid:107) .(b) κ ← N ( µ κ , σ κ ) R + ( κ ) and update the old ω with ω + κ α . Step 4: λ ← G a ( r + 1 , s + | ω | ). Step 5: υ ← IG (cid:16) λ | ω | , λ (cid:17) and set τ = 1 /υ . (cid:18) (cid:17) In this section, we report the results of Monte Carlo experiments in orderto compare our modified block Gibbs sampler with Wang (2012)’s original al-gorithm in terms of accuracy in parameter estimation and graphical structurelearning. For brevity, we shall refer to Wang (2012)’s original algorithm as11GS (block Gibbs sampler) and our modified version as HRS (Hit-and-Runsampler). Following Wang (2012), we examine the following six differentspecifications of the Gaussian graphical model in the Monte Carlo experi-ments:(a) AR(1): σ ij = 0 . | i − j | .(b) AR(2): ω ii = 1 . ω i,i − = ω i − ,i = 0 . ω i,i − = ω i − ,i = 0 . σ ii = 1, σ ij = 0 . ≤ i (cid:54) = j ≤ p/ σ ij = 0 . p/ ≤ i (cid:54) = j ≤ 10 and σ ij = 0 . ω ii = 1 . ω ,i = ω i, = 0 . ω ij = 0 . ω ii = 2 . ω i − ,i = ω i,i − = 1 . ω p = ω p = 0 . ω ii = 2 . ω ij = 1 . i (cid:54) = j .where σ ij (1 (cid:53) i, j (cid:53) p ) is the ( i, j ) element of the covariance matrix Ω − inthe Gaussian graphical model.Other settings for the Monte Carlo experiments also mirror Wang (2012).For each model, we generate a sample of ( p × 1) random vectors y , . . . , y n independently from N ( , Ω − ). We consider two cases: ( n, p ) = (50 , 30) and( n, p ) = (200 , × 2) scenarios in the experiments.The hyperparameters in the prior distribution of λ ij are r = 10 − and s =10 − . For both BGS and HRS, the number of burn-in iterations is 5,000 andthe Monte Carlo sample from the following 10,000 iterations will be usedin Bayesian inference . We repeat each scenario of simulation 50 times andobtain a set of point estimates of Ω . All computations were implemented on aworkstation with 64GB RAM and six-core 3.4GHz Intel Xeon processor usingPython 3.6.1. HRS requires additional computations because it explicitlyimposes the positive definite constraint Ω ∈ M + , but we observed onlymodest difference in computation time between HRS and BGS.In order to compare HRS with BGS in terms of accuracy in point estima-tion of the precision matrix Ω , we compute two sample loss functions, Stein’sloss and Frobenius norm, as measurements of discrepancy between the pointestimate and the true Ω . Table 2 shows the sample median loss (Stein’s loss The same design of simulation (specifications of Ω , combinations of ( n, p ), hyperpa-rameters, burn-in iterations and the size of the Monte Carlo sample) was used in producingthe results in Table 1. 12n the upper half and Frobenius norm in the lower half) of 50 replicationsin 12 scenarios for BGS and HRS. Figures in parentheses are the standarderrors. The loss is unanimously and substantially smaller in HRS than BGS.This observation is valid not only for the Circle model in which the positivedefiniteness of Ω is most frequently violated as shown in Table 1, but alsofor the other models with different graphical structures. Interestingly, HRSoutperforms BGS even for the Full model in which Ω is not sparse and theestimation loss of the graphical LASSO model is expected to be far muchworse. Furthermore, this tendency is unchanged in either small-size ( p = 30)or large-size ( p = 100) model. All in all, the results in Table 2 suggest thatimposing the positive definiteness constraint remarkably improves the accu-racy in point estimation of Ω in the Bayesian adaptive graphical LASSO.To assess the performance of graphical structure learning, we check whetherthe point estimate of Ω can successfully restore the true structure from thesimulated data. Recall that there is no connection between nodes, say node i and node j (1 (cid:53) i, j (cid:53) p ), if ω ij = 0. As in Fan et al. (2009), we use thefollowing rule to determine whether a pair of nodes are connected or not: (cid:40) | ˆ ω ij | (cid:61) − , (node i and node j are connected); | ˆ ω ij | < − , (node i and node j are not connected) , (24)where ˆ ω ij is the point estimate of ω ij computed with the Monte Carlo sampleof Ω that we generate for each scenario with HRS or BGS. Then, with theestimated graphical structures (we have 50 of them), accuracy in graphicalstructure learning is measured with three criteria: specificity, sensitivity andMatthews Correlation Coefficients (MCC), namelySpecificity = TNTN + FP , Sensitivity = TPTP + FN , MCC = TP × TN − FP × FN (cid:112) (TP + FP)(TP + FN)(TN + FN)(TN + FN) , (25)where TP, TN, FP and FN are the number of true positives, true negatives,false positives and false negatives in 50 replications respectively.Table 3 reports the calculated criteria for 12 scenarios. In Table 3,HRS outperforms BGS for all scenarios except for the sensitivity of the Star The results of BGS in Table 3 are far different from those in Wang (2012, Table 2). Weguess that this discrepancy is caused by the difference in criteria for detecting connections.Wang (2012, p. 882) states “we claim { ω ij = 0 } if ˆ ω ij < − as Fan et al. (2009)”, which able 2: Sample Median Loss in Point Estimation of Ω AR(1) AR(2) Block Star Circle FullStein’s loss p = 30BGS 1.88 4.48 1.38 1.52 1.81 19.31(0.32) (0.49) (0.28) (0.26) (0.32) (0.87)HRS (0.20) (0.18) (0.18) (0.20) (0.16) (0.52) p = 100BGS 3.02 4.25 2.81 3.75 3.08 69.65(0.20) (0.26) (0.18) (0.22) (0.18) (1.10)HRS (0.08) (0.08) (0.07) (0.08) (0.05) (0.72)Frobenius norm p = 30BGS 4.04 3.01 2.19 2.19 2.51 29.61(0.55) (0.18) (0.35) (0.30) (0.42) (0.06)HRS (0.27) (0.13) (0.22) (0.22) (0.07) (0.49) p = 100BGS 4.38 2.33 2.90 3.20 2.59 99.61(0.29) (0.12) (0.13) (0.13) (0.25) (0.02)HRS (0.11) (0.05) (0.08) (0.07) (0.03) (0.41)Notes: (a) the smaller loss is boldfaced.(b) figures in parentheses are the standard errors. able 3: Accuracy in Graphical Structure Learning AR(1) AR(2) Block Star CircleSpecificity p = 30BGS 6.00 10.22 7.10 6.77 12.34HRS p = 100BGS 10.69 20.39 12.93 12.15 28.45HRS Sensitivity p = 30BGS 100.00 99.31 100.00 p = 100BGS 100.00 100.00 100.00 100.00 100.00HRS 100.00 100.00 100.00 100.00 100.00MCC p = 30BGS 7.85 12.39 5.19 3.45 11.74HRS p = 100BGS 5.96 11.18 3.89 6.41 10.86HRS Notes: (a) the better result is boldfaced.(b) figures are in percentage. p = 30, though the sensitivity of HRS is still more than 90%.Especially in the case of p = 100, the values of specificity are more than90% for HRS, which means that most of zero off-diagonal elements in Ω arecorrectly identified. This accuracy is quite crucial when we try to detectthe true graphical structure in practice. It seems that imposing the positivedefiniteness constraint also enhances the graphical structure learning in theBayesian adaptive graphical LASSO. In this paper, we proposed a modification to Wang (2012)’s famous Gibbssampling algorithm for Bayesian graphical LASSO. Our modified algorithmguarantees the positive definiteness of the precision matrix throughout theloop of Gibbs sampling by generating the off-diagonal elements of the preci-sion matrix from a truncated multivariate normal distribution whose supportis the region where the updated precision matrix remains positive definite.To facilitate sampling from such a complicated distribution, we proposedto utilize the Hit-and-Run algorithm by B´elisle et al. (1993). The derivedalgorithm is still a pure Gibbs sampler and keeps efficiency and scalabilityof Wang (2012)’s original algorithm. In the simulation study, we showedthat our modified algorithm could remarkably improve accuracy in point es-timation as well as graph structure learning. Since the part of the Gibbssampling algorithm in which the precision matrix is updated is common toother graphical shrinkage models such as the graphical horseshoe model,it is straightforward to incorporate our modified algorithm into the Gibbssampling algorithm for the Bayesian graphical horseshoe model by Li et al.(2019) or other related sampling procedures for Gaussian graphical modelswith scale-mixture-of-normals shrinkage priors. means that a negative ˆ ω ij , whether it is near or far away from zero, is regarded as anevidence against connection between nodes. As a result, negative relations between nodeswould be over-rejected and the estimated graphical structure would be too sparse in thesense that the precision matrix includes too many zeros in the off-diagonal elements. Toconfirm this conjecture, we recalculated the three criteria in (25) without the absolutevalue in (24) and found that the recalculated results were comparably similar to those inWang (2012). upplementary Material Python codes for the block Gibbs sampler and the Hit-and-Run samplerused in this paper is available from https://github.com/sakaesaserunrun/onglasso . References O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection throughsparse maximum likelihood estimation for multivariate gaussian or binarydata. The Journal of Machine Learning Research , 9:485–516, 2008.C. J. P. B´elisle, H. E. Romeijn, and R. L. Smith. Hit-and-run algorithms forgenerating multivariate distributions. Mathmatics of Operations Research ,18(2):255–266, 1993.J. Fan, Y. Feng, and Y. Wu. Network exploration via the adaptive lasso andscad penalties. Annals of Applied Statistics , 3(2):521–541, 2009.J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estima-tion with the graphical lasso. Biostatistics , 5(1):432–441, 2008.J. Guo, E. Levina, G. Michailidis, and J. Zhu. Joint estimation of multiplegraphical models. Biometrika , 98(1):1–15, 2011.Z. Khondker, H. Zhu, W. Lin, and J. Ibrahim. The bayesian covariance lasso. Statistics and Its Inference , 6(2):243–259, 2013.Yunfan Li, Bruce A. Craig, and Anindya Bhadra. The graphical horseshoeestimator for inverse covariance matrices. Journal of Computational andGraphical Statistics , 28(3):747–757, 2019. doi: 10.1080/10618600.2019.1575744.B. M. Marlin and K. P. Murphy. Sparse gaussian graphical models withunknown block structure. In Proceedings of the 26th Annual InternationalConference on Machine Learning , number ICML ’09, pages 705–712, NewYork, NY, USA, 2009. Association for Computing Machinery.B. M. Marlin, M. Schmidt, and K. P. Murphy. Group sparse priors for co-variance estimation. In Proceedings of the Twenty-Fifth Conference on ncertainty in Artificial Intelligence , number UAI ’09, pages 383–392, Ar-lington, Virginia, USA, 2009. AUAI Press.N. Meinshausen and P. B¨uhlmann. High-dimensional graphs and variableselection with the lasso. The Annals of Statistics , 34(3):1436–1462, 2006.T. Park and G. Casella. The bayesian lasso. Journal of the American Sta-tistical Association , 103:681–686, 2008.R. Tibshirani. Regression shrinkage and selection via the lasso. Journal ofthe Royal Statistical Society Series B , 46:267–288, 1996.H. Wang. Bayesian graphical lasso methods and efficient posterior computa-tion. Bayesian Analysis , 7:867–886, 2012.M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphicalmodel.