Scaling It Up: Stochastic Search Structure Learning in Graphical Models
aa r X i v : . [ m a t h . S T ] M a y Bayesian Analysis (2015) , Number 2, pp. 351–377 Scaling It Up: Stochastic Search StructureLearning in Graphical Models
Hao Wang ∗ Abstract.
Gaussian concentration graph models and covariance graph modelsare two classes of graphical models that are useful for uncovering latent depen-dence structures among multivariate variables. In the Bayesian literature, graphsare often determined through the use of priors over the space of positive defi-nite matrices with fixed zeros, but these methods present daunting computationalburdens in large problems. Motivated by the superior computational efficiency ofcontinuous shrinkage priors for regression analysis, we propose a new frameworkfor structure learning that is based on continuous spike and slab priors and useslatent variables to identify graphs. We discuss model specification, computation,and inference for both concentration and covariance graph models. The new ap-proach produces reliable estimates of graphs and efficiently handles problems withhundreds of variables.
Keywords:
Bayesian inference, Bi-directed graph, Block Gibbs, Concentrationgraph models, Covariance graph models, Credit default swap, Undirected graph,Structural learning.
Graphical models use graph structures for modeling and making statistical inferencesregarding complex relationships among many variables. Two types of commonly usedgraphs are undirected graphs, which represent conditional dependence relationshipsamong variables, and bi-directed graphs, which encode marginal dependence amongvariables. Structure learning refers to the problem of estimating unknown graphs fromthe data and is usually carried out by sparsely estimating the covariance matrix of thevariables by assuming that the data follow a multivariate Gaussian distribution. Underthe Gaussian assumption, undirected graphs are determined by zeros in the concentra-tion matrix and their structure learning problems are thus referred to as concentrationgraph models; bi-directed graphs are determined by zeros in the covariance matrix andtheir structure learning problems are thus referred to as covariance graph models. Thiswork concerns structure learning in both concentration and covariance graph models.Classical methods for inducing sparsity often rely on penalized likelihood approaches(Banerjee et al., 2008; Yuan and Lin, 2007; Bien and Tibshirani, 2011; Wang, 2014).Model fitting then uses deterministic optimization procedures such as coordinate de-scents. Thresholding is another popular method for the sparse estimation of covariancematrices for covariance graph models (Bickel and Levina, 2008; Rothman et al., 2009);however, there is no guarantee that the resulting estimator is always positive definite. ∗ Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan48824, U.S.A., [email protected] c (cid:13) Stochastic Search Structure Learning in Graphical Models
Bayesian methods for imposing sparsity require the specification of priors over the spaceof positive definite matrices constrained by fixed zeros. Under such priors, model deter-mination is then carried out through stochastic search algorithms to explore a discretegraphical model space. The inherent probabilistic nature of the Bayesian frameworkpermits estimation via decision-theoretical principles, addresses parameter and modeluncertainty, and provides a global characterization of the parameter space. It also en-courages the development of modular structures that can be integrated with more com-plex systems.A major challenge in Bayesian graphical models is their computation. Althoughsubstantial progress in computation for these two graphical models has been madein recent years, scalability with dimensions remains a significant issue, hindering theability to adapt these models to the growing demand for higher dimensional problems.Recently published statistical papers on these two graphical models either focus onsmall problems or report long computing times. In concentration graph models, Dobraet al. (2011) report that it takes approximately 1.5 days to fit a problem of 48 nodes ona dual-core 2.8 Ghz computer under C; Wang and Li (2012) report approximately twodays for a 100 node problem under MATLAB; and Cheng and Lenkoski (2012) reporta computing time of 1–20 seconds per one-edge update for a 150 node problem usinga 400 core server with 3.2 GHz CPU under R and C++. In covariance graph models,Silva and Ghahramani (2009) fit problems up to 13 nodes and conclude that “furtherimprovements are necessary for larger problems.”To scale up with dimension, this paper develops a new approach called stochasticsearch structure learning (SSSL) to efficiently determine covariance and concentrationgraph models. The central idea behind SSSL is the use of continuous shrinkage priorscharacterized by binary latent indicators in order to avoid the normalizing constantapproximation and to allow block updates of graphs. The use of continuous shrink-age priors contrasts point-mass priors at zeros that are used essentially by all existingmethods for Bayesian structure learning in these two graphical models.The motivation for the SSSL comes from the successful developments of continuousshrinkage priors in several related problems. In regression analysis, continuous shrinkagepriors were used in the seminal paper by George and McCulloch (1993) in the form ofa two component normal mixture for selecting important predictors and these priorshave recently garnered substantial research attention as a computationally attractivealternative for regularizing many regression coefficients (e.g., Park and Casella 2008;Griffin and Brown 2010; Armagan et al. 2013). In estimation of covariance matrices, theyare used for regularizing concentration elements and have been shown to provide fast andaccurate estimates of covariance matrices (Wang, 2012; Khondker et al., 2013). In factoranalysis, they are used instead of point-mass priors (Carvalho et al., 2008) for modelingfactor loading matrices, efficiently handling hundreds of variables (Bhattacharya andDunson, 2011).Nevertheless, the current work is fundamentally different from the aforementionedworks. The research focus here is the structure learning of graphs, which is distinctfrom regression analysis, factor analysis, and the pure covariance estimation problem ao Wang
Assume that y = ( y , y , . . . , y p ) ′ is a p -dimensional random vector following a multi-variate normal distribution N (0 , Σ ) with mean of zero and covariance matrix Σ ≡ ( σ ij ).Let Ω ≡ ( ω ij ) = Σ − be the concentration matrix. Covariance and concentration graphmodels are immediately related to Σ and Ω , respectively. Let Y be the n × p datamatrix consisting of n independent samples of y and let S = Y ′ Y . The theory andexisting methods for structure learning are briefly reviewed in the next two sections. Concentration graph models (Dempster, 1972) consider the concentration matrix Ω and encode conditional dependence using an undirected graph G = ( V, E ), where V = { , , . . . , p } is a non-empty set of vertices and E ⊆ { ( i, j ) : i < j } is a set of edgesrepresenting unordered pairs of vertices. The graph G can also be indexed by a set of p ( p − / Z = ( z ij ) i 12 tr( DΩ ) } { Ω ∈ M + ( Z ) } , (1) p ( Z ) = Y i 1) (Jones et al., 2005). Under (1)–(2), some methods (e.g., Jones et al.2005; Scott and Carvalho 2008; Lenkoski and Dobra 2011) learn Z directly through itsposterior distribution over the model space p ( Z | Y ) ∝ p ( Y | Z ) p ( Z ). Other methodslearn Z indirectly through sampling over the joint space of graphs and concentrationmatrices p ( Ω , Z | Y ) (Giudici and Green, 1999; Dobra et al., 2011; Wang and Li, 2012).Regardless of the types of algorithms, two shared features of these methods cause theframework (1)–(2) to be inefficient for larger p problems. The first of these featuresis that graphs are updated in a one-edge-at-a-time manner, meaning that sweepingthrough all possible edges requires a loop of O ( p ) iterations. The second feature isthat the normalizing constant I GW ( b, D , Z ) for non-decomposable graphs requires ap-proximation. The commonly used Monte Carlo approximation proposed by Atay-Kayisand Massam (2005) is not only unstable in some situations but also requires a matrixcompletion step of time complexity O ( M p ) for M Monte Carlo samples, making thesemethods unacceptably slow in large graphs. Recent works by Wang and Li (2012) andCheng and Lenkoski (2012) propose the use of exchange algorithms to avoid the MonteCarlo approximation. However, the computational burden remains daunting; empiricalexperiments in these papers suggest it would take several days to complete the fittingfor problems of p ≈ 100 on a desktop computer.In the classical formulation, concentration graphs are induced by imposing a graphi-cal lasso penalty on Ω in order to encourage zeros in the penalized maximum likelihoodestimates of Ω (e.g., Yuan and Lin 2007; Friedman et al. 2008; Rothman et al. 2008).In particular, the standard graphical lasso problem is to maximize the penalized log-likelihood log(det Ω ) − tr( S n Ω ) − ρ || Ω || , over the space of positive definite matrices M + , with ρ ≥ || Ω || = P ≤ i,j ≤ p | ω ij | as the L -norm of Ω . The graphical lasso problem has aBayesian interpretation (Wang, 2012). Its estimator is equivalent to the maximum aposteriori estimation under the following prior for Ω : p ( Ω ) = C − Y ≤ i,j ≤ p (cid:8) exp( − λ | ω ij | ) (cid:9) Ω ∈ M + , (3)where C is the normalizing constant. By exploiting the scale mixture of normal repre-sentation, Wang (2012) shows that fitting (3) is very efficient using block Gibbs samplersfor up to the lower hundreds of variables.A comparison between the two Bayesian methods (1)–(2) and (3) helps to explain theintuition behind the proposed SSSL. Model (1)–(2) explicitly treats a graph Z as an un-known parameter and considers its posterior distribution, which leads to straightforwardBayesian inferences about graphs. However, it is slow to run due to the one-edge-at-a-time updating and the normalizing constant approximation. In contrast, Model (3) usescontinuous priors, enabling a fast block Gibbs sampler that updates Ω one column ata time and avoids normalizing constant evaluation. However, no graphs are used in the ao Wang Covariance graph models (Cox and Wermuth, 1993) consider the covariance matrix Σ and encode the marginal dependence using a bi-directed graph G = ( V, E ), where eachedge has bi-directed arrows instead of the full line used by an undirected graph. Similarto concentration graph models, the covariance graph G can also be indexed by binaryvariables Z = ( z ij ) i 12 tr( DΣ − ) } Σ ∈ M + ( Z ) , (4)where b specifies the degrees of freedom, D is the location parameter, and I GIW ( b, D , Z )is the normalizing constant. Structure learning is then carried out through the marginallikelihood function p ( Y | Z ) = (2 π ) − np/ I GIW ( b + n, D + S , Z ) /I GIW ( b, D , Z ). Unfor-tunately, the key quantity of the normalizing constant I GIW ( b, D , Z ) is analytically in-tractable, even for decomposable graphs. Silva and Ghahramani (2009) propose a MonteCarlo approximation via an importance sampling algorithm, which becomes computa-tionally infeasible for p beyond a few dozen. Their empirical experiments are thus lim-ited to small problems (i.e., p < σ ij to be zero if its sample estimate is below a threshold (Bickel and Levina,2008; Rothman et al., 2009; Cai and Liu, 2011). Another approach is motivated by thelasso-type procedures. Bien and Tibshirani (2011) propose a covariance graphical lassoprocedure for simultaneously estimating covariance matrix and marginal dependencestructures. Their method is to minimize the following objective function:log(det Σ ) + tr( S n Σ − ) + ρ || Σ || , (5)56 Stochastic Search Structure Learning in Graphical Models over the space of positive definite matrices M + , with ρ ≥ Σ . Although aBayesian version of (5) has not been explored previously, its derivation is straightforwardthrough the prior p ( Σ ) = C − Y ≤ i,j ≤ p (cid:8) exp( − λ | σ ij | ) (cid:9) Σ ∈ M + , (6)In light of the excellent performance of the Bayesian concentration graphical lasso (3)reported in Wang (2012), we hypothesize that (6) shares similar performances. In fact,we have developed a block Gibbs sampler for (6) and found that it gives a shrinkageestimation of Σ and is computationally efficient, although it provides no explicit treat-ment of the graph Z . The detailed algorithm and results are not reported in this paperbut are available upon request. Comparing (4) and (6) suggests that again, the differentstrengths of (4) and (6) might be combined to provide a better approach for structurelearning in covariance graph models. Let A = ( a ij ) p × p denote a p -dimensional covariance or concentration matrix; that is, A = Σ or Ω . SSSL uses the following new prior for A : p ( A ) = { C ( θ ) } − Y i 1) because it is equal to the integration of the product ofnormal densities and exponential densities over a constrained space M + . Thus, (8) and(9) are proper distributions. The joint distribution of ( A , Z ) acts to cancel out the twoterms of C ( Z , v , v , λ ) and results in a marginal distribution of A , as in (7).The rationale behind using Z for structure learning is as follows. For an appropriatelychosen small value of v , the event z ij = 0 means that a ij comes from the concentratedcomponent N (0 , v ), and so a ij is likely to be close to zero and can reasonably beestimated as zero. For an appropriately chosen large value of v , the event z ij = 1means that a ij comes from the diffuse component N (0 , v ) and so a ij can be estimatedto be substantially different from zero. Because zeros in A determine missing edges ingraphs, the latent binary variables Z can be viewed as edge-inclusion indicators. Givendata Y , the posterior distribution of Z provides information about graphical modelstructures. The remaining questions are then how to specify parameters θ and how toperform posterior computations. π From (9), the hyperparameter π controls the prior distribution of the edge-inclusionindicators in Z . The choice of π should thus reflect the prior belief about what thegraphs will be in the final model. In practice, such prior information is often summarizedvia the marginal prior edge-inclusion probability Pr( z ij = 1). Specifically, a prior for Z is chosen such that the implied edge-inclusion probability of edge ( i, j ) meets theprior belief about the chance of the existence of edge ( i, j ). For example, the commonchoice Pr( z ij = 1) = 2 / ( p − 1) reflects the prior belief that the expected number ofedges is approximately (cid:0) p (cid:1) Pr( z ij = 1) = p . Another important reason that Pr( z ij = 1)is used for calibrating priors over Z is that the posterior inference about Z is usuallybased upon the marginal posterior probability of Pr( z ij = 1 | Y ). For example, themedian probability graph, the graph consisting of those edges whose marginal edge-inclusion probability exceeds 0 . 5, is often used to estimate G (Wang, 2010). Focusingon the marginal edge-inclusion probability allows us to understand how the posteriortruly responds to the data.Calibrating π according to Pr( z ij = 1) requires knowledge of the relation betweenPr( z ij = 1) and π . From (9), the explicit form of Pr( z ij = 1) as a function of π isunavailable because of the intractable term C ( Z , v , v , λ ). A comparison between (9)and (2) helps illustrate the issue. Removing C ( Z , v , v , λ ) from (9) turns it into (2) butwill not cancel out the term C ( Z , v , v , λ ) in (8) for the posterior distribution of Z .Tedious and unstable numerical integration is then necessary to evaluate C ( Z , v , v , λ )at each iteration of sampling Z . Inserting C ( Z , v , v , λ ) into (9) cancels C ( Z , v , v , λ )in (8) in the posterior, thus facilitating computation, yet concerns might be raised58 Stochastic Search Structure Learning in Graphical Models about such a “fortunate” cancelation. For example, Murray (2007) notes that a priorthat cancels out an intractable normalizing constant in the likelihood would dependon the number of data points and would also be so extreme that it would dominateposterior inferences. These two concerns appear to be not problematic in our case. Theprior (9) is for the hyperparameter Z , rather than for the parameter directly involvedin the likelihood; thus it does not depend on sample size. Instead, the prior also onlyintroduces mild bias without dominating the inferences, as shown below.To investigate whether the cancelation of C ( Z , v , v , λ ) causes the prior to be tooextreme, we compute Pr( z ij = 1) numerically from Monte Carlo samples generatedby the algorithm in Section 4.1. In (8)–(9), we first fix π = 2 / ( p − , v = 0 . 05, and λ = 1, and then vary the dimension p ∈ { , , , , } and v = hv with h ∈ { , , } . Panel (a) of Figure 1 displays these estimated Pr( z ij = 1) as afunction of p for different h values. As a reference, the curve Pr( z ij = 1) = 2 / ( p − z ij = 1) from (9) are below the reference curve, suggesting that there isa downward bias introduced by C ( Z , v , v , λ ) on Pr( z ij = 1). The bias is introducedby the fact that the positive definite constraint on A favors a small v , specified by z ij = 0, over a large v , specified by z ij = 1. We also see that the bias is larger at largervalues of h , at which the impact of positive definite constraints is more significant.Next, we fix p = 150 , h = 50 , and λ = 1 and vary v ∈ { . , . , . } and π ∈{ / ( p − , / ( p − , / ( p − , / ( p − , / ( p − } . Panel (b) of Figure 1 displaysthese implied Pr( z ij = 1) as a function of π for different v values. Again, as a reference,the curve Pr( z ij = 1) = π is plotted. The downward bias of Pr( z ij = 1) relativeto π continues to exist and is larger at larger values of v or π because the positivedefinite constraint on A forces z ij = 0 to be chosen more often when v or π is large.Figure 1: The implied prior marginal edge-inclusion probability Pr( z ij = 1) from theprior (8)–(9) as a function of p at different h (left) and a function of π at different v (right), together with two reference curves of Pr( z ij = 1) = 2 / ( p − 1) (left) and ofPr( z ij = 1) = π (right). ao Wang p ( a ij | z ij ) for different values of h and v .Nevertheless, the downward bias seems to be gentle, as Pr( z ij = 1) is never extremelysmall; consequently the prior (8)–(9) is able to let the data reflect the Z if the likelihoodis strong.Another concern is that the lack of analytical relation between Pr( z ij = 1) and π might raise challenges against the incorporation of prior information about Pr( z ij = 1)into π . This problem can be side-stepped by prior simulation and interpolation. Take a p = 150 node problem as an example. If the popular choice Pr( z ij = 1) = 2 / ( p − 1) =0 . 013 is desirable, interpolating the points ( π, Pr( z ij = 1)) in Panel (b) of Figure 1suggests that π should be set approximately at 0 . . . 048 for v = 0 . . 05, and 0 . 1, respectively. Our view is that obtaining these points under prior (8)–(9) is much faster than evaluating C ( Z , v , v , λ ) at each configuration of Z under thetraditional prior (2). v and v From (8), the choice of v should be such that if the data supports z ij = 0 over z ij = 1,then a ij is small enough to be replaced by zero. The choice of v should be such that,if the data favor z ij = 1 against z ij = 0, then a ij can be accurately estimated tobe substantially different from zero. One general strategy for choosing v and v , asrecommended by George and McCulloch (1993), is based on the concept of practicalsignificance. Specifically, suppose a small δ can be chosen such that | a ij | < δ might beregarded as practically insignificantly different from zero. Incorporating such a priorbelief is then achieved by choosing v and v , such that the density p ( a ij | z ij = 0) islarger than the density p ( a ij | z ij = 1), precisely within the interval ( − δ, δ ). An explicitexpression of v as a function of δ and h can be derived when p ( a ij | z ij ) are normal.However, the implied distribution p ( a ij | z ij ) from (8)–(9) is neither normal nor evenanalytically tractable. A numerical study will illustrate some aspects of p ( a ij | z ij ).Figure 2 draws the Monte Carlo estimated density of p ( a ij | z ij = 0) and p ( a ij | z ij = 1) for several settings of v and h . In all cases, there is a clear separation between60 Stochastic Search Structure Learning in Graphical Models p ( a ij | z ij = 0) and p ( a ij | z ij = 1), with a larger h resulting in a sharper separation. Thisproperty of separation between a small and a large variance component is the essenceof the prior for structural learning that aims to separate small and large a ij s. Clearly,the marginal densities are no longer normal. For example, the density of p ( a ij | z ij = 0)is more spiky than that of N (0 , v ); the difference between p ( a ij | z ij = 1) when h = 50and h = 100 is less clear than the difference between N (0 , v ) and N (0 , v ).The lack of an explicit form of p ( a ij | z ij ) makes the strategies of analytically calculating v from the threshold δ infeasible. Numerical methods that estimate p ( a ij | z ij ) fromMarkov chain Monte Carlo (MCMC) samples might be used to choose v according to δ from a range of possible values.Another perspective is that, when v is chosen to be very small and h is chosento be very large, the prior for a ij is close to a point-mass mixture that selects any a ij = 0 as an edge. Because the point-mass mixture prior provides a noninformativemethod of structure learning when the threshold δ cannot be meaningfully specified,it makes sense to choose v to be as small as possible, but not so small that it couldcause MCMC convergence issues, and to choose v to allow for reasonable values of a ij .In our experiments with standardized data, MCMC converges quickly and mixes quitewell, as long as v ≥ . 01 and h ≤ λ The value of λ controls the distribution of the diagonal elements a ii . Because the dataare usually standardized, a choice of λ = 1 assigns substantial probability to the entireregion of plausible values of a ii , without overconcentration or overdispersion. Fromour experience, the data generally contain sufficient information about the diagonalelements, and the structure learning results are insensitive to a range of λ values, suchas λ = 5 and 10. The primary advantage of the SSSL prior (8)–(9) over traditional approaches is itsscalability to larger p problems. The reduction in computing time comes from twoimprovements. One is that (8)–(9) enable block updates of all p ( p − / Z simultaneously, while other methods only update one edge-inclusionindicator z ij at a time. The other is that there is no need of a Monte Carlo approximationof the intractable constants, while all of the other methods require some sort of MonteCarlo integration to evaluate any graph. The general sampling scheme for generatingposterior samples of graphs is to sample from the joint distribution p ( A , Z | Y ) byiteratively generating from p ( A | Z , Y ) and p ( Z | A , Y ). The first conditional p ( A | Z , Y ) is sampled in a column-wise manner and the second conditional p ( Z | A , Y ) isgenerated all at once. The details depend on whether A = Ω for concentration graphmodels or A = Σ for covariance graph models, and they are described below. ao Wang Consider concentration graph models with A = Ω in the hierarchical prior (8)–(9).To sample from p ( Ω | Z , Y ), the following proposition provides necessary conditionaldistributions. The proof is in the Appendix. Proposition 1. Suppose A = Ω in the hierarchical prior (8)–(9). Focus on the lastcolumn and row. Let V = ( v z ij ) be a p × p symmetric matrix with zeros in the diagonalentries and ( v ij ) i Now, consider covariance graph models with A = Σ in the hierarchical prior (8)–(9).To sample from p ( Σ | Z , Y ), the following proposition provides necessary conditionaldistributions; its proof is in the Appendix. Proposition 2. Suppose A = Σ in the hierarchical prior (8)–(9). Focus on the lastcolumn and row. Let V = ( v z ij ) be a p × p symmetric matrix with zeros in the diagonalentries and ( v z ij ) i 1) and λ = 1. All chains areinitialized at the sample covariance matrix. All computations are implemented on a six-core CPU 3.33GHz desktop using MATLAB. For each run, we measure the time it takesthe block Gibbs sampler to sweep across all columns (rows) once, which is called oneiteration. One iteration actually updates each element a ij twice: once when updatingcolumn i and again when updating column j . This property improves its efficiency. The ao Wang p for covariance graph models and concentration graph models respectively.Overall, the SSSL algorithms run fast. Covariance graph models take approximately2 and 9 minutes to generate 1000 iterations for p = 100 and 200; concentration graphmodels take even less time, approximately 1.2 and 5 minutes. The relatively slower speedof covariance graph models is due to a few more matrix inversion steps in updating thecolumns in Σ . We also measure the mixing of the MCMC output by calculating theinefficiency factor 1 + 2 P ∞ k =1 ρ ( k ) where ρ ( k ) is the sample autocorrelation at lag k . Weuse 5000 samples after 2000 burn-ins and K lags in the estimation of the inefficiencyfactors, where K = argmin k { ρ ( k ) < / √ M , k ≥ } with M = 5000 being the totalnumber of saved iterations. The median inefficiency factor among all of the elements of Ω was 1 when p = 100, further suggesting the efficiency of SSSL. In our experience, aMCMC sample of 5000 iterations after 2000 burnins usually generates reliable resultsin terms of Monte Carlo errors for p = 100 node problems, meaning that the computingtime is usually less than 10 minutes, far less than the few days of computing timerequired by the existing methods.Figure 3: Time in minutes for 1000 iterations of SSSL versus dimension p for covariancegraph models (solid) and concentration graph models (dashed). The preceding section shows tremendous computational gains of SSSL over existingmethods. This section evaluates these methods on their structure learning performanceusing two synthetic scenarios, both of which are motivated by real-world applications. Scenario 1. The first scenario mimics the dependence pattern of daily currencyexchange rate returns, which has been previously analyzed via concentration graphmodels (Carvalho and West, 2007; Wang et al., 2011). We use two different syntheticdatasets for the two types of graphical models. For concentration graph models, Joneset al. (2005) generated a simulated dataset of p = 12 and n = 250 that mimics theexchange rate return pattern. We use their original dataset downloaded from their64 Stochastic Search Structure Learning in Graphical Models website. The true underlying concentration graph has 13 edges and is given by Figure2 of Jones et al. (2005). For covariance graph models, we estimate a sparse covariancematrix with 13 edges based on the data of Jones et al. (2005) and then use this sparsecovariance matrix to generate n = 250 samples. The true sparse covariance matrix is asfollows: . To assess the performance of structure learning, we compute the number of truepositives (TP) and the number of false positives (FP). We begin by evaluating somebenchmark models against which to compare SSSL. For concentration graph models,we consider the classical adaptive graphical lasso (Fan et al., 2009) and the Bayesian G -Wishart prior W G (3 , I p ) in (1). For covariance graph models, we consider the classicaladaptive thresholding (Cai and Liu, 2011) and the Bayesian G -inverse Wishart prior IW G (3 , I p ) in (4). The two classical methods require some tuning parameters, whichare chosen by 10-fold cross-validation. The adaptive graphical lasso has TP = 8 andFP = 7 for concentration graph models and the adaptive thresholding has TP = 13 andFP = 45 for covariance graph models. They seem to favor less sparse models, most likelybecause the sample size is large relative to the dimensions. The two Bayesian methodsare implemented under the priors of graphs (2) with π = 2 / ( p − G -Wishart has TP = 9 and FP = 1 for concentrationgraph models, and the G -inverse Wishart has TP = 7 and FP = 0 for covariance graphmodels.We investigate the performance of SSSL by considering a range of hyperparametervalues that represent different prior beliefs about graphs: π ∈ { / ( p − , / ( p − , . } , h ∈ { , , } , and v ∈ { . , . , . } . Under each hyperparameter setting, asample of 10000 iterations is used to estimate the posterior median graph, which isthen compared against the true underlying graph. Table 1 and 2 summarize TP and FPfor concentration and covariance graphs respectively. Within each table, patterns canbe observed by comparing results across different values of one hyperparameter whilefixing the others. For v , a larger value lowers both TP and FP because v is positivelyrelated to the threshold of practical significance that a ij can be treated as zero. For π ,a larger value increases both TP and FP, and the case of π = 0 . h , increasing h reduces both TP and FP, partly because h ispositively related to the threshold that a ij can be practically treated as zero and in partbecause h is negatively related to the implied prior edge-inclusion probability Pr( z ij )as discussed in Panel (a) of Figure 1. Next, comparing the results in Tables 1–2 withthe two Bayesian benchmarks reported above, we can see that SSSL is competitive. For ao Wang π = 0 . v = 0 . G -inverse Wishart method forcovariance graph models.Table 1: Summary of performance measures under different hyperparameters for a p =12 concentration graph example. As for benchmarks, the classical adaptive graphicallasso has TP=8 and FP=7; the Bayesian G-Wishart has TP=9 and FP=1. π = 2 / ( p − π = 4 / ( p − π = 0 . v h = 10 h = 50 h = 100 h = 10 h = 50 h = 100 h = 10 h = 50 h = 100 Number of true positives (TP) . 02 9 9 9 10 10 10 10 10 100 . 05 10 9 9 10 10 9 10 10 100 . Number of false positives (FP) . 02 3 2 0 7 3 1 14 4 30 . 05 2 0 0 4 1 0 8 2 00 . Table 2: Summary of performance measures under different hyperparameters for a p = 12 covariance graph example. As for benchmarks, the classical adaptive thresholdinghas TP=13 and FP=45; the Bayesian G -inverse Wishart has TP=7 and FP=0. π = 2 / ( p − π = 4 / ( p − π = 0 . v h = 10 h = 50 h = 100 h = 10 h = 50 h = 100 h = 10 h = 50 h = 100 Number of true positives (TP) . 02 7 7 7 8 7 7 9 7 70 . 05 7 7 6 7 7 7 7 7 70 . Number of false positives (FP) . 02 0 0 0 1 0 0 5 0 00 . 05 0 0 0 0 0 0 0 0 00 . Scenario 2. The second scenario mimics the dependence pattern of gene expressiondata, for which graphical models are used extensively to understand the underlyingbiological relationships. The real data are the breast cancer data (Jones et al., 2005;Castelo and Roverato, 2006), which contain p = 150 genes related to the estrogenreceptor pathway. Similar to the first scenario, we generate two synthetic datasets forthe two graphical models. For concentration graph models, we first estimate a sparseconcentration matrix with 179 edges based on the real data, and then generate a sampleof 200 observations from this estimated sparse concentration matrix. For covariancegraph models, we estimate a sparse covariance matrix with 101 edges based on thereal data and then use this sparse covariance matrix to generate a synthetic data of n = 200 samples. Panels (a) and (b) of Figure 4 display the frequencies of the non-zeropartial correlations and correlations implied by these two underlying sparse matrices,respectively. Among the nonzero elements, 13% of the partial correlations and 60% ofthe correlations are within 0 . Stochastic Search Structure Learning in Graphical Models Figure 4: Histograms showing the empirical frequency of the non-zero elements of thepartial correlation matrix for the first dataset (left) and of the nonzero elements of thecorrelation matrix for the second dataset (right) in Scenario 2 of p = 150.FP = 929 for concentration graph models, and the adaptive thresholding has TP = 14and FP = 12 for covariance graph models. The G -Wishart prior has TP = 105 andFP = 2 and takes about four days to run. The evaluation of the G -inverse Wishart isworth elaborating since existing experiments are conducted in smaller p settings andlittle is known about its performance in high-dimensional problems.The original model fitting algorithm for the G -inverse Wishart relies on the compu-tationally expensive importance sampling for approximating the normalizing constantand thus is slow and numerically instable for this p =150 dataset. Silva (2013) developsa new approximation that requires no Monte Carlo integration, which greatly speeds upthe computation. The MATLAB routine implementing his algorithm is publicly avail-able on that paper’s website. We adopt these functions with one modification that setsthe edge-inclusion probability to be 2 / ( p − Z isfaster than Silva (2013)’s one-edge-at-a-time update.Using the posterior median graph as an estimate of G , the G -inverse Wishart prior IW G (3 , I ) produces TP = 3 and FP = 3. These two numbers are surprisingly small. Anexploration of the G -inverse Wishart distribution provides some explanations. The mainreason is that when G is sparse and p is large, the G -inverse Wishart prior inadvertentlyenforces the free elements in Σ towards zero and hence exerts a strong prior influenceon the posterior distribution. To see this, suppose G is empty, then (4) implies that thediagonal elements { σ ii } follow independent inverse Gamma distributions p ( σ ii | b ) ∝ σ − b +2 p ii exp {− d ii σ ii } , i = 1 , . . . , p, ao Wang p and converge to zero rapidly as p increasesfor a fixed b . Now suppose G is arbitrarily sparse. Although the theoretical marginaldistributions of the free elements in Σ are unknown, the distribution of { σ ii } underan empty graph leads us to conjecture that the free elements in Σ could be extremelyconcentrated around zero for large p as well. In our p = 150 example, a simulation from IW G (3 , I p ) under the ground truth G that contains 101 edges supports this conjecture.The estimated prior mean of these 101 off-diagonal free elements { σ ij } is in the rangeof − × − and 6 × − ; the estimated prior standard deviation is between 1 . × − and 2 . × − . Such a tightly concentrated prior provides little probability support forthe true graph. The implication on structure learning is that the Bayes factor mightnot truly respond to the data, but largely reflect prior prejudices that σ ij are extremelysmall. In other words, the overly concentrated prior does not allow the data to speakabout G and consequently the posterior distribution of G is dominated by its prior. Infact, the posterior sample mean and standard deviation of the number of edges in Z computed from the MCMC output of Silva (2013) are 170 and 12 . 9, which are closeto the prior expected number of edges, computed as (cid:0) p (cid:1) × p − = 150 and its standarddeviation, computed as q(cid:0) p (cid:1) × p − × (1 − p − ) = 12 . { b : b > } assumed by Silva and Ghahramani (2009) is too restrictive. It might bereasonable to let the parameter space depend on G . For example, the standard inverse-Wishart theory implies that the parameter space should be { b : b > − p } when G is empty and so b could be even negative, and { b : b > } when G is full. Hierarchicalmodels that allow the value of b to be G -dependent might be helpful. A thoroughexamination along these lines is beyond the scope of the current paper. However, it isprobably safe to conclude that further investigation should be called upon to follow theinnovative framework of Silva and Ghahramani (2009).As for SSSL, Tables 3 and 4 summarize its performance. When the results are com-pared across different levels of one hyperparameter, the general relations between TPor FP and a hyperparameter are similar to those in Scenario 1. In fact, the patternsappear to be more significant in Scenario 2 because priors have greater influences inthis relatively small sample size problem. When compared with benchmarks, SSSL iscompetitive. For concentration graph models, Table 3 suggests that, except for a fewextreme priors that favor overly dense graphs, SSSL produces much sparser graphs thanthe classical adaptive graphical lasso, for which FP = 929 is too high. When v = 0 . G -Wishart prior. When v increases, TP dropsquickly because many signals are weak (Figure 4) and are thus treated as practicallyinsignificant by SSSL. For covariance graph models, Table 4 suggests that SSSL gener-ally performs better than the adaptive thresholding. The only exceptions are cases inwhich the hyperparameters favor overly dense graphs (e.g., π = 0 . 5) or overly sparsegraphs (e.g., v = 0 . v = 0 . , h = 50,and π = 2 / ( p − Stochastic Search Structure Learning in Graphical Models Table 3: Summary of performance measures under different hyperparameters for a p = 150 concentration graph example. As for benchmarks, the classical adaptive lassohas TP=145 and FP=929; the Bayesian G-Wishart has TP=105 and FP=2. π = 2 / ( p − π = 4 / ( p − π = 0 . v h = 10 h = 50 h = 100 h = 10 h = 50 h = 100 h = 10 h = 50 h = 100 Number of true positives (TP) . 02 106 101 100 110 105 102 162 136 1250 . 05 90 82 78 97 89 83 146 117 1090 . Number of false positives (FP) . 02 3 0 0 9 1 0 1533 238 910 . 05 0 0 0 0 0 0 667 39 90 . Table 4: Summary of performance measures under different hyperparameters for a p =150 covariance graph example. As for benchmarks, the classical adaptive thresholdinghas TP=14 and FP=12; the Bayesian G -inverse Wishart cannot run. π = 2 / ( p − π = 4 / ( p − π = 0 . v h = 10 h = 50 h = 100 h = 10 h = 50 h = 100 h = 10 h = 50 h = 100 Number of true positives (TP) . 02 20 19 15 20 20 17 38 27 240 . 05 12 10 10 14 11 10 27 15 150 . Number of false positives (FP) . 02 4 1 1 6 3 1 1084 163 690 . 05 0 0 0 0 0 0 229 17 30 . structure learning accuracy. However, SSSL’s computational advantage of sheer speedand simplicity makes it very attractive for routine uses. This section illustrates the practical utility of graphical models by applying them tocredit default swap (CDS) data. CDS is a credit protection contract in which the buyerof the protection periodically pays a small amount of money, known as “spread”, to theseller of protection in exchange for the seller’s payoff to the buyer if the reference entitydefaults on its obligation. The spread depends on the creditworthiness of the referenceentity and thus can be used to monitor how the market views the credit risk of thereference entity. The aim of this statistical analysis is to understand the cross-sectionaldependence structure of CDS series and thus the joint credit risks of the entities. Theinterconnectedness of credit risks is important, as it is an example of the systemic risk– the risk that many financial institutions fail together.The data are provided by Markit via Wharton Research Data Services (WRDS)and comprise daily closing CDS spread quotes for 79 widely traded North Americanreference entities from January 2, 2001 through April 23, 2012. The quotes are for five-year maturity, as five-year maturity is the most widely traded and studied term. Thespread quote is then transformed into log returns for graphical model analysis. ao Wang t , we use the daily CDSreturns over the period of month t − 11 to month t to estimate the graph G t . Thechoice of a one-year window is intended to balance the number of observations, as wellas to accommodate the time-varying nature of the graphs. The first estimation periodbegins in January 2001 and continues through December 2001, and the last is fromMay 2011 to April 2012. In total, there are 89 estimation periods, corresponding to89 time-varying graphs for each type of graph. We set the prior hyperparameters at v = 0 . , h = 50 , π = 2 / ( p − 1) and λ = 1 . The MCMC are run for 10000 iterationsand the first 3000 iterations are discarded. The graphs are estimated by the posteriormedian graph.Figure 5: Time series plots of the estimated numbers of edges for the covariance graph(solid line), the concentration graph (dashed), and the common graph (dash dotted).Figure 5 shows changes over time of the estimated number of edges for the two typesof graphs and the number of common edges. The numbers of edges are in the range of 40and 160 out of a total of 3081 possible edges, indicating a very high level of sparsity. Ateach time point, both types of graph reflect about the same level of sparsity. Over time,they show similar patterns of temporal variations. There is a steady upward trend in thenumber of edges starting from mid 2007 for both types of graph. The timing of the riseof the number of edges is suggestive. Mid-2007 saw the start of the subprime-mortgagecrisis, which was signified by a number of credit events, including bankruptcy and thesignificant loss of several banks and mortgage firms, such as Bear Stearns, GM finance,UBS, HSBC and Countrywide. If the number of edges can be viewed as the “degree ofconnectedness” among CDS series, then this observed increase implies that the markettends to be more integrated during periods of credit crisis and consequently tends tohave a higher systemic risk.To further illustrate the interpretability of graphs, Figure 6 provides four snapshotson graph details at two time points, in December 2004 and in December 2008. Thecovariance graph for December 2004 (Panel a) has 44 edges and 30 completely isolated70 Stochastic Search Structure Learning in Graphical Models Figure 6: Estimated graphs of 79 CDS returns at two different time points. The fourpanels are: Panel (a), covariance graph in December 2004; Panel (b), covariance graphin December 2008; Panel (c), concentration graph in December 2004; and Panel (d)concentration graph in December 2008.nodes, as opposed to the covariance graph for December 2008 (Panel b), which has 128edges and no isolated nodes. These differences suggest that the connectedness of thenetwork rises as the credit risks of many reference entities become linked with each other.The same message is further confirmed by concentration graphs. The concentrationgraph for December 2004 (Panel c) has 55 edges and 20 completely isolated nodes,while the concentration graph for December 2008 (Panel d) has 135 edges and no isolatednodes. The increase of connectedness is also manifested by the fact that every pair ofnodes in the concentration graph on December 2008 is connected by a path.Finally, we zoom into subgraphs involving the American International Group (AIG) ao Wang Stochastic Search Structure Learning in Graphical Models This paper proposes a new framework for the structure learning of concentration graphmodels and covariance graph models. The main goal is to overcome the scalabilitylimitation of existing methods without sacrificing structure learning accuracy. The keyidea is to use absolutely continuous spike and slab priors instead of the popular point-mass mixture priors to enable accurate, fast, and simple structure learning. Our analysissuggests that the accuracy of these new methods is comparable to that of existingBayesian methods, but the model-fitting is vastly faster to run and simpler to implement.Problems with 100–250 nodes can now be fitted in a few minutes, as opposed to a fewdays or even numeric infeasibility under existing methods. This remarkable efficiencywill facilitate the application of Bayesian graphical models to large problems and willprovide scalable and effective modular structures for more complicated models.The focus of the paper is on the structure learning of graphs. A related yet differentquestion is the parameter estimation of the covariance matrix. Since our priors place zeroprobability mass on any sparse matrix containing exact zeros, as opposed to the point-mass mixture priors, one concern is then about where and how the posterior of Ω or Σ will concentrate when the true covariance/concentration matrix is sparse. Our limitedexperiments suggest that the posterior distribution from the proposed two-componentnormals is indeed more dispersed around zero than those from the point-mass mixturepriors. Take the concentration graph in Scenario 1 as an example. Under SSSL, theaverage magnitude of the posterior mean estimates of these { ω ij } corresponding to thetrue zeros in Ω is in the range of 0.0085 and 0.045, depending on the hyperparametersof ( v , h, π ). In contrast, under the G -Wishart prior, the estimates of these { ω ij } havea smaller average magnitude of 0.0040. The small-variance normal shrinks parametersless aggressively than the point-mass mixture. Although we do not find it problematicfor structure learning, such weaker shrinkage may cause the SSSL’s performance of pa-rameter estimation to be suboptimal. A refinement is to replace the normal distributionwith distributions having higher densities near zero. The Bayesian shrinkage regres-sion literature shows that some heavy-tailed distributions offer comparable parameterestimation performance to the point-mass mixture priors (e.g., Armagan et al. 2013;Griffin and Brown 2010; Carvalho et al. 2010). Computational tractability of SSSL ismaintained by applying the data-augmentation to the mixture normal representationof these alternative distributions. Appendix Proof of Proposition 1 Clearly, the conditional distribution of Ω given the edge-inclusion indicators Z is p ( Ω | Y , Z ) ∝ | Ω | n exp {− tr( 12 SΩ ) } Y i Given edge-inclusion indicators Z , the conditional posterior distribution of Σ is p ( Σ | Y , Z ) ∝ | Σ | − n exp {− 12 tr( SΣ − ) } Y i The MATLAB routines implementing all frequentist and Bayesian procedures used inthe paper are available from the author’s web site of the paper at: . References Armagan, A., Dunson, D. B., and Lee, J. (2013). “Generalized double Pareto shrinkage.” Statistica Sinica , 23(1): 119. 352, 372Atay-Kayis, A. and Massam, H. (2005). “The marginal likelihood for decomposable andnon-decomposable graphical Gaussian models.” Biometrika , 92: 317–335. 354Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). “Model Selection ThroughSparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data.” The Journal of Machine Learning Research , 9: 485–516. 351Bhattacharya, A. and Dunson, D. B. (2011). “Sparse Bayesian infinite factor models.” Biometrika , 98(2): 291–306. 352Bickel, P. J. and Levina, E. (2008). “Covariance Regularization by Thresholding.” TheAnnals of Statistics , 36(6): 2577–2604. 351, 355Bien, J. and Tibshirani, R. J. (2011). “Sparse estimation of a covariance matrix.” Biometrika , 98(4): 807–820. 351, 355Cai, T. and Liu, W. (2011). “Adaptive Thresholding for Sparse Covariance MatrixEstimation.” Journal of the American Statistical Association , 106(494): 672–684.URL http://amstat.tandfonline.com/doi/abs/10.1198/jasa.2011.tm10560 ao Wang Journal of the American Statistical Association , 103(484): 1438–1456. 352Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). “The horseshoe estimator forsparse signals.” Biometrika , 97(2): 465–480. 372Carvalho, C. M. and West, M. (2007). “Dynamic matrix-variate graphical models.” Bayesian Analysis , 2: 69–98. 363Castelo, R. and Roverato, A. (2006). “A Robust Procedure For Gaussian GraphicalModel Search From Microarray Data With p Larger Than n.” Journal of MachineLearning Research , 7: 2621–2650. 365Chaudhuri, S., Drton, M., and Richardson, T. S. (2007). “Estimation of a covariancematrix with zeros.” Biometrika , 94(1): 199–216. 355Cheng, Y. and Lenkoski, A. (2012). “Hierarchical Gaussian graphical models: Beyondreversible jump.” Electronic Journal of Statistics , 6: 2309–2331. 352, 354Cox, D. R. and Wermuth, N. (1993). “Linear Dependencies Represented by ChainGraphs.” Statistical Science , 8(3): 204–218. 355Dawid, A. P. and Lauritzen, S. L. (1993). “Hyper-Markov laws in the statistical analysisof decomposable graphical models.” Annals of Statistics , 21: 1272–1317. 353Dempster, A. (1972). “Covariance selection.” Biometrics , 28: 157–175. 353Dobra, A., Lenkoski, A., and Rodriguez, A. (2011). “Bayesian inference for generalGaussian graphical models with application to multivariate lattice data.” Journal ofthe American Statistical Association , 106(496): 1418–1433. 352, 354Fan, J., Feng, Y., and Wu, Y. (2009). “Network exploration via the adaptive LASSOand SCAD penalties.” Annals of Applied Statistics , 3(2): 521–541. 364Friedman, J., Hastie, T., and Tibshirani, R. (2008). “Sparse inverse covariance estima-tion with the graphical lasso.” Biostatistics , 9(3): 432–441. 354George, E. I. and McCulloch, R. E. (1993). “Variable selection via Gibbs sampling.” Journal of the American Statistical Association , 88: 881–889. 352, 359, 361Giudici, P. and Green, P. J. (1999). “Decomposable graphical Gaussian model deter-mination.” Biometrika , 86: 785–801. 354Griffin, J. E. and Brown, P. J. (2010). “Inference with normal-gamma prior distributionsin regression problems.” Bayesian Analysis , 5(1): 171–188. 352, 372Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C., and West, M. (2005). “Exper-iments in stochastic computation for high-dimensional graphical models.” StatisticalScience , 20: 388–400. 354, 363, 364, 365Kauermann, G. (1996). “On a Dualization of Graphical Gaussian Models.” Scandina-vian Journal of Statistics , 23(1): pp. 105–116. 35576 Stochastic Search Structure Learning in Graphical Models Khare, K. and Rajaratnam, B. (2011). “Wishart distributions for decomposable covari-ance graph models.” Annals of Statistics , 39(1): 514–555. 355Khondker, Z. S., Zhu, H., Chu, H., Lin, W., and Ibrahim, J. G. (2013). “The Bayesiancovariance lasso.” Statistics and its interface , 6(2): 243. 352Lenkoski, A. and Dobra, A. (2011). “Computational Aspects Related to Inference inGaussian Graphical Models With the G-Wishart Prior.” Journal of Computationaland Graphical Statistics , 20(1): 140–157. 354Murray, I. (2007). Advances in Markov chain Monte Carlo methods . University CollegeLondon: PhD. Thesis. 358Park, T. and Casella, G. (2008). “The Bayesian Lasso.” Journal of the AmericanStatistical Association , 103(482): 681–686. 352Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). “Sparse permutationinvariant covariance estimation.” Electronic Journal of Statistics , 2: 494–515. 354Rothman, A. J., Levina, E., and Zhu, J. (2009). “Generalized Thresholding of LargeCovariance Matrices.” Journal of the American Statistical Association , 104(485):177–186. 351, 355Roverato, A. (2002). “Hyper-Inverse Wishart Distribution for Non-decomposableGraphs and its Application to Bayesian Inference for Gaussian Graphical Models.” Scandinavian Journal of Statistics , 29: 391–411. 353Scott, J. G. and Carvalho, C. M. (2008). “Feature-Inclusion Stochastic Search forGaussian Graphical Models.” Journal of Computational and Graphical Statistics ,17(4): 790–808. 354Silva, R. (2013). “A MCMC Approach for Learning the Structure of Gaussian AcyclicDirected Mixed Graphs.” In Statistical Models for Data Analysis , 343–351. Springer.366, 367Silva, R. and Ghahramani, Z. (2009). “The Hidden Life of Latent Variables: BayesianLearning with Mixed Graph Models.” Journal of Machine Learning Research , 10:1187–1238. 352, 355, 367Wang, H. (2010). “Sparse seemingly unrelated regression modelling: Applications infinance and econometrics.” Computational Statistics & Data Analysis , 54(11): 2866–2877. 357— (2012). “The Bayesian Graphical Lasso and Efficient Posterior Computation.” Bayesian Analysis , 7(2): 771–790. 352, 354, 356— (2014). “Coordinate descent algorithm for covariance graphical lasso.” Statistics andComputing , 24(4): 521–529. 351Wang, H. and Li, S. Z. (2012). “Efficient Gaussian graphical model determination underG-Wishart prior distributions.” Electronic Journal of Statistics , 6: 168–198. 352, 354 ao Wang Bayesian Analysis , 6(4): 639–664.363Wermuth, N., Cox, D. R., and Marchetti, G. M. (2006). “Covariance Chains.” Bernoulli ,12: 841–862. 355Yuan, M. and Lin, Y. (2007). “Model selection and estimation in the Gaussian graphicalmodel.” Biometrika , 94(1): 19–35. 351, 354