High-dimensional learning of linear causal networks via inverse covariance estimation
aa r X i v : . [ s t a t . M L ] N ov Linear SEMs via inverse covariance estimation
High-dimensional learning of linear causal networks viainverse covariance estimation
Po-Ling Loh [email protected]
Department of StatisticsUniversity of CaliforniaBerkeley, CA 94720, USA
Peter B¨uhlmann [email protected]
Seminar f¨ur StatistikETH Z¨urichSwitzerland
Abstract
We establish a new framework for statistical estimation of directed acyclic graphs(DAGs) when data are generated from a linear, possibly non-Gaussian structural equationmodel. Our framework consists of two parts: (1) inferring the moralized graph from thesupport of the inverse covariance matrix; and (2) selecting the best-scoring graph amongstDAGs that are consistent with the moralized graph. We show that when the error variancesare known or estimated to close enough precision, the true DAG is the unique minimizerof the score computed using the reweighted squared ℓ -loss. Our population-level resultshave implications for the identifiability of linear SEMs when the error covariances are spec-ified up to a constant multiple. On the statistical side, we establish rigorous conditionsfor high-dimensional consistency of our two-part algorithm, defined in terms of a “gap”between the true DAG and the next best candidate. Finally, we demonstrate that dynamicprogramming may be used to select the optimal DAG in linear time when the treewidth ofthe moralized graph is bounded. Keywords:
Causal inference, dynamic programming, identifiability, inverse covariancematrix estimation, linear structural equation models
1. Introduction
Causal networks arise naturally in a wide variety of application domains, including ge-netics, epidemiology, and time series analysis (Hughes et al., 2000; Stekhoven et al., 2012;Aalen et al., 2012). The task of inferring the graph structure of a causal network fromjoint observations is a relevant but challenging problem. Whereas undirected graphicalstructures may be estimated via pairwise conditional independence testing, with worst-casetime scaling as the square of the number of nodes, estimation methods for directed acyclicgraphs (DAGs) first require learning an appropriate permutation order of the vertices,leading to computational complexity that scales exponentially in the graph size. Greedyalgorithms present an attractive computationally efficient alternative, but such methodsare not generally guaranteed to produce the correct graph (Chickering, 2002). In contrast,exact methods for causal inference that search exhaustively over the entire DAG space mayonly be tractable for relatively small graphs (Silander and Myllymaki, 2006). oh and B¨uhlmann In practice, knowing prior information about the structure of the underlying DAG may leadto vast computational savings. For example, if a natural ordering of the nodes is known,inference may be performed by regressing each node upon its predecessors and selecting thebest functional fit for each node. This yields an algorithm with runtime linear in the numberof nodes and overall quadratic complexity. In the linear high-dimensional Gaussian setting,one could apply a version of the graphical Lasso, where the feasible set is restricted to matri-ces that are upper-triangular with respect to the known ordering (Shojaie and Michailidis,2010). However, knowing the node order seems quite unrealistic in practice, except forcertain special cases. If instead a conditional independence graph or superset of the skele-ton is specified a priori, the number of required conditional independence tests may alsobe reduced dramatically. This appears to be a more reasonable assumption, and vari-ous authors have devised algorithms to compute the optimal DAG efficiently in settingswhere the input graph has bounded degree and/or bounded treewidth (Perrier et al., 2008;Ordyniak and Szeider, 2012; Korhonen and Parviainen, 2013).Unfortunately, appropriate tools for inferring such superstructures are rather limited,and the usual method of using the graphical Lasso to estimate a conditional independencegraph is rigorously justified only in the linear Gaussian setting (Yuan and Lin, 2007). Re-cent results have established that a version of the graphical Lasso may also be used to learna conditional independence graph for variables taking values in a discrete alphabet whenthe graph has bounded treewidth (Loh and Wainwright, 2013), but results for more gen-eral distributions are absent from the literature. B¨uhlmann et al. (2013) isolate sufficientconditions under which Lasso-based linear regression could be used to recover a conditionalindependence graph for general distributions, and use it as a prescreening step for nonpara-metric causal inference in additive noise models; however, it is unclear which non-Gaussiandistributions satisfy such conditions.
We propose a new algorithmic strategy for inferring the DAG structure of a linear, poten-tially non-Gaussian structural equation model (SEM). Deviating slightly from the literature,we use the term non-Gaussian to refer to the fact that the variables are not jointly Gaussian;however, we do not require non-Gaussianity of all exogenous noise variables, as assumedby Shimizu et al. (2011). We proceed in two steps, where each step is of independent in-terest: First, we infer the moralized graph by estimating the inverse covariance of the jointdistribution. The novelty is that we justify this approach for non-Gaussian linear SEMs.Second, we find the optimal causal network structure by searching over the space of DAGsthat are consistent with the moralized graph and selecting the DAG that minimizes anappropriate score function. When the score function is decomposable and the moralizedgraph has bounded treewidth, the second step may be performed via dynamic program-ming in time linear in the number of nodes (Ordyniak and Szeider, 2012). Our algorithm isalso applicable in a high-dimensional setting when the moralized graph is sparse, where weestimate the support of the inverse covariance matrix using a method such as the graphicalLasso (Ravikumar et al., 2011). Our algorithmic framework is summarized in Algorithm 1: inear SEMs via inverse covariance estimation Algorithm 1
Framework for DAG estimation Input:
Data samples { x i } ni =1 from a linear SEM Obtain estimate b Θ of inverse covariance matrix (e.g., using graphical Lasso) Construct moralized graph c M with edge set defined by supp( b Θ) Compute scores for DAGs that are consistent with c M (e.g., using squared ℓ -error) Find minimal-scoring b G (using dynamic programming when score is decomposableand c M has bounded treewidth) Output:
Estimated DAG b G We prove the correctness of our graph estimation algorithm by deriving new resultsabout the theory of linear SEMs. We present a novel result showing that for almost ev-ery choice of linear coefficients, the support of the inverse covariance matrix of the jointdistribution is identical to the edge structure of the moralized graph. Although a similar re-lationship between the support of the inverse covariance matrix and the edge structure of anundirected conditional independence graph has long been established for multivariate Gaus-sian models (Lauritzen, 1996), our result does not involve any assumptions of Gaussianity,and the proof technique is completely new.Since we do not impose constraints on the error distribution of our SEM, standardparametric maximum likelihood methods are not applicable to score and compare candi-date DAGs. Consequently, we use the squared ℓ -error to score DAGs, and prove that inthe case of homoscedastic errors, the true DAG uniquely minimizes this score function. Asa side corollary, we establish that the DAG structure of a linear SEM is identifiable when-ever the additive errors are homoscedastic, which generalizes a recent result derived onlyfor Gaussian variables (Peters and B¨uhlmann, 2013). In addition, our result covers caseswith Gaussian and non-Gaussian errors, whereas Shimizu et al. (2011) require all errors tobe non-Gaussian (see Section 4.2). A similar result is implicitly contained under some as-sumptions in van de Geer and B¨uhlmann (2013), but we provide a more general statementand additionally quantify a regime where the errors may exhibit a certain degree of het-eroscedasticity. Thus, when errors are not too heteroscedastic, the much more complicatedICA algorithm (Shimizu et al., 2006, 2011) may be replaced by a simple scoring methodusing squared ℓ -loss.On the statistical side, we show that our method produces consistent estimates of thetrue DAG by invoking results from high-dimensional statistics. We note that our the-oretical results only require a condition on the gap between squared ℓ -scores for vari-ous DAGs in the restricted search space and eigenvalue conditions on the true covari-ance matrix, which is much weaker than the restrictive beta-min condition from previ-ous work (van de Geer and B¨uhlmann, 2013). Furthermore, the size of the gap is not re-quired to scale linearly with the number of nodes in the graph, unlike similar conditionsin van de Geer and B¨uhlmann (2013) and Peters and B¨uhlmann (2013), leading to gen-uinely high-dimensional results. Although the precise size of the gap relies heavily on thestructure of the true DAG, we include several examples giving intuition for when our con-dition could be expected to hold (see Sections 4.4 and 5.2 below). Finally, since inversecovariance matrix estimation and computing scores based on linear regression are both eas- oh and B¨uhlmann ily modified to deal with systematically corrupted data (Loh and Wainwright, 2012), weshow that our methods are also applicable for learning the DAG structure of a linear SEMwhen data are observed subject to corruptions such as missing data and additive noise.The remainder of the paper is organized as follows: In Section 2, we review the generaltheory of probabilistic graphical models and linear SEMs. Section 3 describes our results onthe relationship between the inverse covariance matrix and conditional independence graphof a linear SEM. In Section 4, we discuss the use of the squared ℓ -loss for scoring candidateDAGs. Section 5 establishes results for the statistical consistency of our proposed inferencealgorithms and explores the gap condition for various graphs. Finally, Section 6 describeshow dynamic programming may be used to identify the optimal DAG in linear time, whenthe moralized graph has bounded treewidth. Proofs of supporting results are contained inthe Appendix.
2. Background
We begin by reviewing some basic background material and introducing notation for thegraph estimation problems studied in this paper.
In this section, we briefly review the theory of directed and undirected graphical models,also known as conditional independence graphs (CIGs). For a more in-depth exposition,see Lauritzen (1996) or Koller and Friedman (2009) and the references cited therein.
Consider a probability distribution q ( x , . . . , x p ) and an undirected graph G = ( V, E ), where V = { , . . . , p } and E ⊆ V × V . We say that G is a conditional independence graph (CIG)for q if the following Markov condition holds: For all disjoint triples (
A, B, S ) ⊆ V suchthat S separates A from B in G , we have X A ⊥⊥ X B | X S . Here, X C := { X j : j ∈ C } forany subset C ⊆ V . We also say that G represents the distribution q .By the well-known Hammersley-Clifford theorem, if q is a strictly positive distribution(i.e., q ( x , . . . , x p ) > x , . . . , x p )), then G is a CIG for q if and only if we may write q ( x , . . . , x p ) = Y C ∈C ψ C ( x C ) , (1)for some potential functions { ψ C : C ∈ C} defined over the set of cliques C of G . Inparticular, note that the complete graph on p nodes is always a CIG for q , but CIGs withfewer edges often exist. Changing notation slightly, consider a directed graph G = ( V, E ), where we now distinguishbetween edges ( j, k ) and ( k, j ). We say that G is a directed acyclic graph (DAG) if thereare no directed paths starting and ending at the same node. For each node j ∈ V , letPa( j ) := { k ∈ V : ( k, j ) ∈ E } denote the parent set of j , where we sometimes write Pa G ( j ) inear SEMs via inverse covariance estimation to emphasize the dependence on G . A DAG G represents a distribution q ( x , . . . , x p ) if q factorizes as q ( x , . . . , x p ) ∝ p Y j =1 q ( x j | q Pa( j ) ) . (2)Finally, a permutation π of the vertex set V = { , . . . , p } is a topological order for G if π ( j ) < π ( k ) whenever ( j, k ) ∈ E . Such a topological order exists for any DAG, but it maynot be unique. The factorization (2) implies that X j ⊥⊥ X ν ( j ) | X Pa( j ) for all j , where ν ( j ) := { k ∈ V : ( j, k ) E and k Pa( j ) } is the set of all nondescendants of j excludingits parents.Given a DAG G , we may form the moralized graph M ( G ) by fully connecting all nodeswithin each parent set Pa( j ) and dropping the orientations of directed edges. Note thatmoralization is a purely graph-theoretic operation that transforms a directed graph into anundirected graph. However, if the DAG G represents a distribution q , then M ( G ) is alsoa CIG for q . This is because each set { j } ∪ Pa( j ) forms a clique C j in M ( G ), and we maydefine the potential functions ψ C j ( x C j ) := q ( x j | q Pa( j ) ) to obtain the factorization (1) fromthe factorization (2).Finally, we define the skeleton of a DAG G to be the undirected graph formed bydropping orientations of edges in G . Note that the edge set of the skeleton is a subset of theedge set of the moralized graph, but the latter set is generally much larger. The skeleton isnot in general a CIG. We now specialize to the framework of linear structural equation models.We say that a random vector X = ( X , . . . , X p ) ∈ R p follows a linear structural equationmodel (SEM) if X = B T X + ǫ, (3)where B is a strictly upper triangular matrix known as the autoregression matrix . Weassume E[ X ] = E[ ǫ ] = 0 and ǫ j ⊥⊥ ( X , . . . , X j − ) for all j .In particular, observe that the DAG G with vertex set V = { , . . . , p } and edge set E = { ( j, k ) : B jk = 0 } represents the joint distribution q on X . Indeed, equation (3)implies that q ( X j | X , . . . , X j − ) = q ( X j | X Pa G ( j ) ) , so we may factorize q ( X , . . . , X p ) = p Y j =1 q ( X j | X , . . . , X j − ) = p Y j =1 q ( X j | X Pa G ( j ) ) . Given samples { X i } ni =1 , our goal is to infer the unknown matrix B , from which we mayrecover G (or vice versa).
3. Moralized graphs and inverse covariance matrices
In this section, we describe our main result concerning inverse covariance matrices of linearSEMs. It generalizes a result for multivariate Gaussians, and states that the inverse covari- oh and B¨uhlmann ance matrix of the joint distribution of a linear SEM reflects the structure of a conditionalindependence graph.We begin by noting that E[ X j | X , . . . , X j − ] = b Tj X, where b j is the j th column of B , and b j = (cid:16) Σ j, j − (cid:0) Σ j − , j − (cid:1) − , , . . . , (cid:17) T . Here, Σ := cov[ X ]. We call b Tj X the best linear predictor for X j amongst linear combinationsof { X , . . . , X j − } . Defining Ω := cov[ ǫ ] and Θ := Σ − , we then have the following lemma,proved in Appendix C.1: Lemma 1
The matrix of error covariances is diagonal:
Ω = diag( σ , . . . , σ p ) for some σ i > . The entries of Θ are given by Θ jk = − σ − k B jk + X ℓ>k σ − ℓ B jℓ B kℓ , ∀ j < k, (4)Θ jj = σ − j + X ℓ>j σ − ℓ B jℓ , ∀ j. (5)In particular, equation (4) has an important implication for causal inference, which westate in the following theorem. Recalling the notation of Section 2.1.2, the graph M ( G )denotes the moralized DAG. Theorem 2
Suppose X is generated from the linear structural equation model (3) . Then Θ reflects the graph structure of the moralized DAG; i.e., for j = k , we have Θ jk = 0 if ( j, k ) is not an edge in M ( G ) . Proof
Suppose j = k and ( j, k ) is not an edge in M ( G ), and assume without loss ofgenerality that j < k . Certainly, ( j, k ) E , implying that B jk = 0. Furthermore, j and k cannot share a common child, or else ( j, k ) would be an edge in M ( G ). This implies that ei-ther B jℓ = 0 or B kℓ = 0 for each ℓ > k . The desired result then follows from equation (4).In the results to follow, we will assume that the converse of Theorem 2 holds, as well.This is stated in the following Assumption: Assumption 1
Let ( B, Ω) be the matrices of the underlying linear SEM. For every j < k ,we have − σ − k B jk + X ℓ>k σ − ℓ B jℓ B kℓ = 0 only if B jk = 0 and B jℓ B kℓ = 0 for all ℓ > k . Combined with Theorem 2, Assumption 1 implies that Θ jk = 0 if and only if ( j, k ) is notan edge in M ( G ). (Since Θ ≻
0, the diagonal entries of Θ are always strictly positive.)Note that when the nonzero entries of B are independently sampled continuous randomvariables, Assumption 1 holds for all choices of B except on a set of Lebesgue measure zero. inear SEMs via inverse covariance estimation Remark 3
Note that Theorem 2 may be viewed as an extension of the canonical result forGaussian graphical models. Indeed, a multivariate Gaussian distribution may be writtenas a linear SEM with respect to any permutation order π of the variables, giving rise to aDAG G π . Theorem 2 states that supp(Θ) is always a subset of the edge set of M ( G π ) . As-sumption 1 is a type of faithfulness assumption (Koller and Friedman, 2009; Spirtes et al.,2000). By selecting different topological orders π , one may then derive the usual resultthat X j ⊥⊥ X k | X \{ j,k } if and only if Θ jk = 0 , for the Gaussian setting. Note that thisconditional independence assertion may not always hold for linear SEMs, however, sincenon-Gaussian distributions are not necessarily expressible as a linear SEM with respect toan arbitrary permutation order. Indeed, we only require Assumption 1 to hold with respectto a single (fixed) order.
4. Score functions for DAGs
Having established a method for reducing the search space of DAGs based on estimatingthe moralized graph, we now move to the more general problem of scoring candidate DAGs.As before, we assume the setting of a linear SEM.Parametric maximum likelihood is often used as a score function for statistical estimationof DAG structure, since it enjoys the nice property that the population-level version ismaximized only under a correct parametrization of the model class. This follows from therelationship between maximum likelihood and KL divergence:arg max θ E θ [log p θ ( X )] = arg min θ E θ (cid:20) log (cid:18) p θ ( X ) p θ ( X ) (cid:19)(cid:21) = arg min θ D KL ( p θ k p θ ) , and the latter quantity is minimized exactly when p θ ≡ p θ , almost everywhere. If themodel is identifiable, this happens only if θ = θ .However, such maximum likelihood methods presuppose a fixed parametrization for themodel class. In the case of linear SEMs, this translates into an appropriate parametrizationof the error vector ǫ . For comparison, note that minimizing the squared ℓ -error for ordi-nary linear regression may be viewed as a maximum likelihood approach when errors areGaussian, but the ℓ -minimizer is still statistically consistent for estimation of the regres-sion vector when errors are not Gaussian. When our goal is recovery of the autoregressionmatrix B of the DAG, it is therefore natural to ask whether squared ℓ -error could be usedin place of maximum likelihood as an appropriate metric for evaluating DAGs.We will show that in settings when the noise variances { σ j } pj =1 are specified up to aconstant (e.g., homoscedastic error), the answer is affirmative. In such cases, the true DAGuniquely minimizes the ℓ -loss. As a side result, we also show that the true linear SEM isidentifiable. Remark 4
Nowzohour and B¨uhlmann (2013) study the use of nonparametric maximumlikelihood methods for scoring candidate DAGs. We remark that such methods could alsobe combined with the framework of Sections 3 and 6 to select the optimal DAG for linearSEMs with nonparametric error distributions: First, estimate the moralized graph via theinverse covariance matrix, and then find the DAG with minimal score using a method suchas dynamic programming. Similar statistical guarantees would hold in that case, with para-metric rates replaced by nonparametric rates. However, our results in this section imply that oh and B¨uhlmann in settings where the error variances are known or may be estimated accurately, the muchsimpler method of squared ℓ -loss may be used in place of a more complicated nonparametricapproach. ℓ -loss Suppose X is drawn from a linear SEM (3), where we now use B to denote the trueautoregression matrix and Ω to denote the true error covariance matrix. For a fixeddiagonal matrix Ω = diag( σ , . . . , σ p ) and a candidate matrix B with columns { b j } pj =1 ,define the score of B with respect to Ω according toscore Ω ( B ) = E h k Ω − / ( I − B ) T X k i = p X j =1 σ j · E[( X j − b Tj X ) ] . (6)This is a weighted squared ℓ -loss, where the prediction error for the j th coordinate isweighted by the diagonal entry σ j coming from Ω, and expectations are taken with respectto the true distribution on X .It is instructive to compare the score function (6) to the usual parametric maximumlikelihood when X ∼ N (0 , Σ). For a distribution parametrized by the pair ( B, Ω), theinverse covariance matrix is Θ = ( I − B )Ω − ( I − B ) T , so the expected log likelihood isE X ∼ N (0 , Σ) [log p B, Ω ( X )] = − tr[( I − B )Ω − ( I − B ) T Σ] + log det[( I − B )Ω − ( I − B ) T ]= − tr[( I − B )Ω − ( I − B ) T Σ] + log det(Ω − )= − score Ω ( B ) + log det(Ω − ) . Hence, minimizing the score over B for a fixed Ω is identical to maximizing the likelihood.For non-Gaussians, however, the convenient relationship between minimum score and max-imum likelihood no longer holds.Now let D denote the class of DAGs. For G ∈ D , define the score of G to bescore Ω ( G ) := min B ∈U G { score Ω ( B ) } , (7)where U G := { B ∈ R p × p : B jk = 0 when ( j, k ) E ( G ) } is the set of matrices that are consistent with the structure of G . Remark 5
Examining the form of the score function (6) , we see that if { Pa G ( j ) } pj =1 de-notes the parent sets of nodes in G , then the matrix B G := arg min B ∈U G { score Ω ( B ) } is unique, and the columns of B G are equal to the coefficients of the best linear predictor of X j regressed upon X Pa G ( j ) . Furthermore, the value of B G does not depend on Ω . The following lemma relates the score of the underlying DAG G to the score of thetrue autoregression matrix B . In fact, the score of any DAG containing G has the samescore. The proof is contained in Appendix C.2. inear SEMs via inverse covariance estimation Lemma 6
Suppose X follows a linear SEM with autoregression matrix B , and let G denote the underlying DAG. Consider any G ∈ D such that G ⊆ G . Then for any diagonalweight matrix Ω , we have score Ω ( G ) = score Ω ( B ) , and B is the unique minimizer of score Ω ( B ) over U G . We now turn to the main theorem of this section, in which we consider the problem ofminimizing score Ω ( B ) with respect to all matrices B that are permutation similar to uppertriangular matrices. Such a result is needed to validate our choice of score function, sincewhen the DAG structure is not known a priori, the space of possible autoregression matricesmust include all U := S G ∈D U G . Note that U may be equivalently defined as the set of allmatrices that are permutation similar to upper triangular matrices. We have the followingvital result: Theorem 7
Given a linear SEM (3) with error covariance matrix α Ω and autoregressionmatrix B , where α > , we have score α Ω ( B ) ≥ score α Ω ( B ) = p, ∀ B ∈ U , (8) with equality if and only if B = B . The proof of Theorem 7, which is based on matrix algebra, is contained in Section 4.5. Inparticular, Theorem 7 implies that the squared ℓ -loss function (6) is indeed an appropriatemeasure of model fit when the components are correctly weighted by the diagonal entriesof Ω .Note, however, that Theorem 7 requires the score to be taken with respect to (a multipleof) the true error covariance matrix Ω . The following example gives a cautionary messagethat if the weights Ω are chosen incorrectly, minimizing score Ω ( B ) may produce a structurethat is inconsistent with the true model. Example 1
Suppose ( X , X ) is distributed according to the following linear SEM: X = ǫ , and X = − X ǫ , so the autoregression matrix is given by B = (cid:18) − (cid:19) . Let Ω = (cid:18) (cid:19) . Consider B = (cid:18) − (cid:19) . Then score I ( B ) < score I ( B ) , (9) so using squared ℓ -loss weighted by the identity will select an inappropriate model. Proof
To verify equation (9), we first computeΣ = ( I − B ) − T Ω ( I − B ) = (cid:18) − −
12 12 (cid:19) . oh and B¨uhlmann Then E[ k X − B T X k ] = tr (cid:2) ( I − B ) T Σ( I − B ) (cid:3) = tr (cid:20)(cid:18) / / (cid:19)(cid:21) = 1 , E[ k X − B T X k ] = tr (cid:2) ( I − B ) T Σ( I − B ) (cid:3) = tr (cid:20)(cid:18) / (cid:19)(cid:21) = 54 , implying inequality (9). Theorem 7 also has a useful consequence in terms of identifiability of a linear SEM, whichwe state in the following corollary:
Corollary 8
Consider a fixed diagonal covariance matrix Ω , and consider the class oflinear SEMs parametrized by the pair ( B, α Ω ) , where B ∈ U and α > is a scale factor.Then the true model ( B , α Ω ) is identifiable. In particular, the class of homoscedasticlinear SEMs is identifiable. Proof
By Theorem 7, the matrix B is the unique minimizer of score Ω ( B ). Since α · (Ω ) = var[ X ], the scale factor α is also uniquely identifiable. The statementabout homoscedasticity follows by taking Ω = I .Corollary 8 should be viewed in comparison to previous results in the literature regardingidentifiability of linear SEMs. Theorem 1 of Peters and B¨uhlmann (2013) states that when X is Gaussian and ǫ is an i.i.d. Gaussian vector with cov[ ǫ ] = α Ω , the model is identifiable.Indeed, our Corollary 8 implies that result as a special case, but it does not impose anyadditional conditions concerning Gaussianity. Shimizu et al. (2006) establish identifiabilityof a linear SEM when ǫ is a vector of independent, non-Gaussian errors, by reducing toICA, but our result does not require errors to be non-Gaussian.The significance of Corollary 8 is that it supplies an elegant proof showing that the modelis still identifiable even in the presence of both Gaussian and non-Gaussian components,provided the error variances are specified up to a scalar multiple. Since any multivariateGaussian distribution may be written as a linear SEM with respect to an arbitrary order-ing, some constraint such as variance scaling or non-Gaussianity is necessary in order toguarantee identifiability. Theorem 7 implies that when the diagonal variances of Ω are known up to a scalar factor,the weighted ℓ -loss (6) may be used as a score function for linear SEMs. Example 1 showsthat when Ω is misspecified, we may have B arg min B ∈U { score Ω ( B ) } . In this section,we further study the effect when Ω is misspecified. Intuitively, provided Ω is close enoughto Ω (or a multiple thereof), minimizing score Ω ( B ) with respect to B should still yield thecorrect B . inear SEMs via inverse covariance estimation Consider an arbitrary diagonal weight matrix Ω . We first provide bounds on the ratiobetween entries of Ω and Ω which ensure that B = arg min B ∈U { score Ω ( B ) } , even thoughthe model is misspecified. Let a max := λ max (Ω Ω − ) , and a min := λ min (Ω Ω − ) , denote the maximum and minimum ratios between corresponding diagonal entries of Ω and Ω . Now define the additive gap between the score of G and the next best DAG, givenby ξ := min G ∈D , G G { score Ω ( G ) − score Ω ( G ) } = min G ∈D ,G G { score Ω ( G ) } − p. (10)By Theorem 7, we know that ξ >
0. The following theorem provides a sufficient conditionfor correct model selection in terms of the gap ξ and the ratio a max a min , which are both invariantto the scale factor α . It is a measure of robustness for how roughly the entries of Ω maybe approximated and still produce B as the unique minimizer. The proof of the theoremis contained in Appendix C.3. Theorem 9
Suppose a max a min ≤ ξp . (11) Then B ∈ arg min B ∈U { score Ω ( B ) } . If inequality (11) is strict, then B is the uniqueminimizer of score Ω ( B ) . Remark 10
Theorem 9 provides an error allowance concerning the accuracy to which wemay specify the error covariances and still recover the correct autoregression matrix B froman improperly weighted score function. In the case Ω = α Ω , we have a max = a min = 1 , sothe condition (11) is always strictly satisfied, which is consistent with our earlier result inTheorem 7 that B = arg min B ∈U { score α Ω ( B ) } . Naturally, the error tolerance specified by Theorem 9 is a function of the gap ξ betweenthe true DAG G and the next best candidate: If ξ is larger, the score is more robust tomisspecification of the weights Ω. Note that if we restrict our search space from the full setof DAGs D to some smaller space D ′ , so B ∈ S G ∈D ′ U G , we may restate the condition inTheorem 9 in terms of the gap ξ ( D ′ ) := min G ∈D ′ , G G { score Ω ( G ) − score Ω ( G ) } , (12)which may be considerably larger than ξ when D ′ is much smaller than D (cf. equation (21)and Section 5.2 on weakening the gap condition below).Specializing to the case when Ω = I , we may interpret Theorem 9 as providing a windowof variances around which we may treat a heteroscedastic model as homoscedastic, and usethe simple (unweighted) squared ℓ -score to recover the correct model. See Lemma 11 inthe next section for a concrete example. In this section, we develop examples illustrating the gap ξ introduced in Section 4.3. Westudy two cases, involving two and three variables, respectively. oh and B¨uhlmann b b (a) (b) Figure 1:
Two-variable DAG. (a) The forward model. (b) The backward model.
We first consider the simplest example with a two-variable directed graph. Suppose theforward model is defined by B = (cid:18) b (cid:19) , Ω = (cid:18) d d (cid:19) , and consider the backward matrix defined by the autoregression matrix B = (cid:18) b (cid:19) . (13)The forward and backward models are illustrated in Figure 1.A straightforward calculation shows thatscore Ω ( B ) = 2 + b b + (cid:18) b d d − b d d (cid:19) , which is minimized for b = b b + d d , implying that ξ = min b { score Ω ( B ) − score Ω ( B ) } = b d d + b d d . (14)We see that the gap ξ grows with the strength of the true edge b , when | b | >
1, and issymmetric with respect to the sign of b . The gap also grows with the magnitude of theratio d d .To gain intuition for the interplay between b and d d , we derive the following lemma, acorollary of Theorem 9 specialized to the case of the two-variable DAG: Lemma 11
Consider the two-variable DAG defined by equation (13) . Let Ω = I anddefine r := d d . Suppose the following conditions hold: b ≥ ( r (cid:16) ( r −
1) + √ r − (cid:17) , if r ≥ , (1 − r ) + √ − r , if r ≤ . (15) Then B = arg min B ∈U { score Ω ( B ) } ; i.e., B is the unique minimizer of the score functionunder the unweighted squared- ℓ loss. Lemma 11 is proved in Appendix C.4. inear SEMs via inverse covariance estimation b b Figure 2:
Three-variable DAG with v -structure. Remark 12
Note that the two right-hand expressions in inequality (15) are similar, al-though the expression in the case r ≥ contains an extra factor of r , so the sufficientcondition is stronger. Both lower bounds in equation (15) increase with | r − | , which agreeswith intuition: If the true model is more non-homoscedastic, the strength of the true edgemust be stronger in order for the unweighted squared- ℓ score to correctly identify the model.When r = 1 , we have the vacuous condition b ≥ , since Ω = α Ω and the variances arecorrectly specified, so Theorem 7 implies B = arg min B ∈U { score Ω ( B ) } for any choice of b . v -structure We now examine a three-variable graph. Suppose the actual graph involves a v -structure,as depicted in Figure 2, and is parametrized by the matrices B = b b , Ω = d d
00 0 d . (16)We have the following lemma, proved in Appendix C.5: Lemma 13
Consider the three-variable DAG characterized by Figure 2 and equations (16) .The gap ξ defined by equation (10) is given by ξ = min b d d + b d d , b d d + b d d . A key reduction in the proof of Lemma 13 is to note that we only need to consider arelatively small number of DAGs, given by Figure 3 in Appendix C.5. Indeed, for G ⊆ G ,we have score Ω ( G ) ≤ score Ω ( G ), so it suffices to consider maximal elements in the posetof DAGs not containing the true DAG G . Remark 14
Note that the form of the gap in Lemma 13 is very similar to the form forthe two-variable model, and the individual ratios scale with the strength of the edge and theratio of the corresponding error variances. Indeed, we could derive a version of Lemma 11for the three-variable model, giving lower bounds on the edge strengths b and b thatguarantee the accuracy of the unweighted squared ℓ -loss; however, the conditions wouldbe more complicated. It is interesting to note from our calculations in Appendix C.5 thatthe gap between models accumulates according to the number of edge reversals from themisspecified model: Reversing the directions of edges (2 , and (1 , in succession leads to oh and B¨uhlmann an additional term in the expressions for ξ and ξ in equations (44) and (45) below. Wewill revisit these observations in Section 5.2, where we define a version of the gap functionrescaled by the number of nodes that differ in their parents sets. First note that for a constant α >
0, we havescore α Ω ( B ) = 1 α · score Ω ( B ) , so minimizing score α Ω ( B ) is equivalent to minimizing score Ω ( B ). Hence, it is sufficient toprove the statement for α = 1.Recalling equation (40), we may writescore Ω ( B ) = E B [ k Ω − / ( I − B ) T X k ]= tr h Ω − / ( I − B ) T · cov B [ X ] · ( I − B )Ω − / i = tr h Ω − / ( I − B ) T · ( I − B ) − T Ω ( I − B ) − · ( I − B )Ω − / i . Now note that ( I − B )Ω − / = Ω − / ( I − Ω / B Ω − / ) := Ω − / ( I − e B ) , Ω / ( I − B ) − = ( I − Ω / B Ω − / ) − Ω / := ( I − e B ) − Ω / , where e B, e B ∈ U . Hence, we may rewritescore Ω ( B ) = tr h ( I − e B ) T Ω − / · Ω / ( I − e B ) − T · ( I − e B ) − Ω / · Ω − / ( I − e B ) i = tr h ( I − e B ) T ( I − e B ) − T ( I − e B ) − ( I − e B ) i = tr h ( I − e B )( I − e B ) T ( I − e B ) − T ( I − e B ) − i . Since e B, e B ∈ U , the matrices I − e B and I − e B are both permutation similar to lowertriangular matrices with 1’s on the diagonal. Hence, Lemma 27 in Appendix B impliesscore Ω ( B ) ≥ p, with equality if and only if I − e B = I − e B , or equivalently, B = B , as claimed.
5. Consequences for statistical estimation
The population-level results in Theorems 2 and 7 provide a natural avenue for estimatingthe DAG of a linear SEM from data. In this section, we outline how the true DAG may beestimated in the presence of fully-observed or systematically corrupted data. Our method isapplicable also in the high-dimensional setting, assuming the moralized DAG is sufficientlysparse.Our inference algorithm consists of two main components: inear SEMs via inverse covariance estimation
1. Estimate the moralized DAG M ( G ) using the inverse covariance matrix of X .2. Search through the space of DAGs consistent with M ( G ), and find the DAG thatminimizes score Ω ( B ).Theorem 2 and Assumption 1 ensure that for almost every choice of autoregressionmatrix B , the support of the true inverse covariance matrix Θ exactly corresponds tothe edge set of the moralized graph. Theorem 7 ensures that when the weight matrix Ω ischosen appropriately, B will be the unique minimizer of score Ω ( B ). We now present concrete statistical guarantees for the correctness of our algorithm in theusual setting when { x i } ni =1 are fully-observed and i.i.d. Recall that a random variable X is sub-Gaussian with parameter σ ifE[exp( λ ( X − E[ X ]))] ≤ exp (cid:18) σ λ (cid:19) , ∀ λ ∈ R . If X ∈ R p is a random vector, it is sub-Gaussian with parameter σ if v T X is a sub-Gaussianrandom variable with parameter σ for all unit vectors v ∈ R p . We first consider the problem of inferring Θ . LetΘ min0 := min j,k (cid:8) | (Θ ) jk | : (Θ ) jk = 0 (cid:9) denote the magnitude of the minimum nonzero element of Θ . We consider the followingtwo scenarios: Low-dimensional setting. If n ≥ p , the sample covariance matrix b Σ = n P ni =1 x i x Ti isinvertible, and we use the estimator b Θ = ( b Σ) − . We have the following lemma, which follows from standard bounds in random matrix theory:
Lemma 15
Suppose the x i ’s are i.i.d. sub-Gaussian vectors with parameter σ . With prob-ability at least − c exp( − c p ) , we have k b Θ − Θ k max ≤ c σ r pn , and thresholding b Θ at level τ = c σ q pn succeeds in recovering supp(Θ ) , if Θ min0 > τ . For the proof, see Appendix D.1. Here, we use to k · k max denote the elementwise ℓ ∞ -normof a matrix. oh and B¨uhlmann High-dimensional setting. If p > n , we assume each row of the true inverse covariancematrix Θ is d -sparse. Then we use the graphical Lasso: b Θ ∈ arg min Θ (cid:23) tr(Θ b Σ) − log det(Θ) + λ X j = k | Θ jk | . (17)Standard results (Ravikumar et al., 2011) establish the statistical consistency of the graphi-cal Lasso (17) as an estimator for the inverse covariance matrix in the setting of sub-Gaussianobservations; consequently, we omit the proof of the following lemma. Lemma 16
Suppose the x i ’s are i.i.d. sub-Gaussian vectors with parameter σ . Supposethe sample size satisfies n ≥ Cd log p . With probability at least − c exp( − c log p ) , wehave k b Θ − Θ k max ≤ c σ r log pn , and thresholding b Θ at level τ = c σ q log pn succeeds in recovering supp(Θ ) , if Θ min0 > τ . Alternatively, we may perform nodewise regression with the ordinary Lasso (Meinshausen and B¨uhlmann,2006) to recover the support of Θ , with similar rates for statistical consistency. Moving on to the second step of the algorithm, we need to estimate the score functionsscore Ω ( B ) of candidate DAGs and choose the minimally scoring candidate. In this section,we focus on methods for estimating an empirical version of the score function and deriverates for statistical estimation under certain models. If the space of candidate DAGs issufficiently small, we may evaluate the empirical score function for every candidate DAGand select the optimum. In Section 6, we describe computationally efficient proceduresbased on dynamic programming to choose the optimal DAG when the candidate space istoo large for naive search.The input of our algorithm is the sparsely estimated inverse covariance matrix b Θ fromSection 5.1.1. For a matrix Θ, define the candidate neighborhood sets N Θ ( j ) := { k : k = j and Θ jk = 0 } , ∀ j, and let D Θ := { G ∈ D : Pa G ( j ) ⊆ N Θ ( j ) , ∀ j } denote the set of DAGs with skeleton contained in the graph defined by supp(Θ). ByTheorem 2 and Assumption 1, we have G ∈ D Θ , so if supp( b Θ) ⊇ supp(Θ ), which occurswith high probability under the conditions of Section 5.1.1, it suffices to search over thereduced DAG space D b Θ . Remark 17
In fact, we could reduce the search space even further to only include DAGswith moralized graph equal to the undirected graph defined by supp(Θ) . The dynamic pro-gramming algorithm to be described in Section 6 only requires as input a superset of theskeleton; for alternative versions of the dynamic programming algorithm taking as input asuperset of the moralized graph, we would indeed restrict D Θ to DAGs with the correct moralstructure. inear SEMs via inverse covariance estimation We now consider an arbitrary d -sparse matrix Θ, with d ≤ n , and take G ∈ D Θ . ByRemark 5, we have score Ω ( G ) = p X j =1 f σ j (Pa G ( j )) , (18)where f σ j ( S ) := 1 σ j · E[( x j − b Tj x S ) ] , and b Tj x S is the best linear predictor for x j regressed upon x S . In order to estimatescore Ω ( G ), we use the empirical functions b f σ j ( S ) := 1 σ j · n n X i =1 ( x ij − x Ti,S b b j ) = 1 σ j · n k X j − X S b b j k , (19)where b b j := ( X TS X S ) − X TS X j is the usual ordinary least squares solution for linear regression of X j upon X S . We willtake S ⊆ N Θ ( j ), so since | N Θ ( j ) | ≤ d ≤ n , the matrix X TS X S is invertible w.h.p. Thefollowing lemma, proved in Appendix D.2, provides rates of convergence for the empiricalscore function: Lemma 18
Suppose the x i ’s are i.i.d. sub-Gaussian vectors with parameter σ . Suppose d ≤ n is a parameter such that | N Θ ( j ) | ≤ d for all j . Then | b f σ j ( S ) − f σ j ( S ) | ≤ c σ σ j r log pn , ∀ j and S ⊆ N Θ ( j ) , (20) with probability at least − c exp( − c log p ) . In particular, we have the following result, proved in Appendix D.3, which provides asufficient condition for the empirical score functions to succeed in selecting the true DAG.Here, ξ Ω ( D Θ ) := min G ∈D Θ ,G G { score Ω ( G ) − score Ω ( G ) } (21)is the gap between G and the next best DAG in D Θ . Note that the expression (21)is reminiscent of the expression (12) defined in Section 4.3, but we now allow Ω to bearbitrary. Lemma 19
Suppose inequality (20) holds, and suppose c σ r log pn · p X j =1 σ j < ξ Ω ( D Θ )2 . (22) Then [ score Ω ( G ) < [ score Ω ( G ) , ∀ G ∈ D Θ : G G . (23) oh and B¨uhlmann Remark 20
Lemma 19 does not explicitly assume that Ω is equal to Ω , the true ma-trix of error variances. However, inequality (22) can only be satisfied when ξ Ω ( D G ) > ;hence, Ω should be chosen such that G = arg min G ∈D Θ ,G G { score Ω ( G ) } . As discussed inSection 4.3, this condition holds for a wider range of choices for Ω . Note that the conclusion (23) in Lemma 19 is not quite the same as the condition G = arg min G ∈D Θ ,G G (cid:8) [ score Ω ( G ) (cid:9) , (24)which is what we would need for exact recovery of our score-minimizing algorithm. Theissue is that score Ω ( G ) is equal for all G ⊇ G ; however the empirical scores [ score Ω ( G ) maydiffer among this class, so equation (24) may not be satisfied. However, it is easily seenfrom the proof of Lemma 19 that in fact,arg min G ∈D Θ (cid:8) [ score Ω ( G ) (cid:9) ⊆ { G ∈ D Θ : G ⊇ G } . (25)By applying a thresholding procedure to the empirical score minimizer b G ⊇ G selected byour algorithm, we could then recover the true G . In other words, since Pa G ( j ) ⊆ Pa b G ( j )for each j , we could use standard sparse regression techniques to recover the parent set ofeach node in the true DAG.To gain some intuition for the condition (22), consider the case when σ j = 1 for all j .Then the condition becomes c σ r log pn < ξ ( D Θ )2 p . (26)If ξ ( D Θ ) = Ω(1), which might be expected based on our calculations in Section 4.4, werequire n ≥ Cp log p in order to guarantee statistical consistency, which is not a trulyhigh-dimensional result. On the other hand, if ξ ( D Θ ) = Ω( p ), as is assumed in similar workon score-based DAG learning (van de Geer and B¨uhlmann, 2013; B¨uhlmann et al., 2013),our method is consistent provided log pn →
0. In Section 5.2, we relax the condition (26) toa slightly weaker condition that is more likely to hold in settings of interest.
Motivated by our comments from the previous section, we establish a sufficient condition forstatistical consistency that is slightly weaker than the condition (22), which still guaranteesthat equation (25) holds.For two DAGs
G, G ′ ∈ D , define H ( G, G ′ ) := { j : Pa G ( j ) = Pa G ′ ( j ) } to be the set of nodes on which the parent sets differ between graphs G and G ′ , and definethe ratio γ Ω ( G, G ′ ) := score Ω ( G ) − score Ω ( G ′ ) | H ( G, G ′ ) | , a rescaled version of the gap between the score functions. Consider the following condition: inear SEMs via inverse covariance estimation Assumption 2
There exists ξ ′ > such that γ Ω ( G ) := min G ∈D Θ ,G G (cid:26) max G ⊇ G { γ Ω ( G, G ) } (cid:27) ≥ ξ ′ . (27)Note that in addition to minimizing over DAGs in the class D Θ , the expression (27) definedin Assumption 2 takes an inner maximization over DAGs containing G . As established inLemma 6, we have score Ω ( G ) = score Ω ( G ) whenever G ⊆ G . However, | H ( G, G ) | maybe appreciably different from | H ( G, G ) | , and we are only interested in computing the gapratio between a DAG G G and the closest DAG containing G .We then have the following result, proved in Appendix D.4: Lemma 21
Under Assumption 2, suppose | b f σ j ( S ) − f σ j ( S ) | ≤ ξ ′ , ∀ j and S ⊆ N Θ ( j ) . (28) Then the containment (25) holds.
Combining with Lemma 18, we have the following corollary:
Corollary 22
Suppose the x i ’s are i.i.d. sub-Gaussian with parameter σ , and | N Θ ( j ) | ≤ d for all j . Also suppose Assumption 2 holds. Then with probability at least − c exp( − c log p ) ,condition (25) is satisfied. We now turn to the question of what values of ξ ′ might be expected to give condition (27)for various DAGs. Certainly, we have γ Ω ( G, G ′ ) ≥ score Ω ( G ) − score Ω ( G ′ ) p , so the condition holds when p · ξ ′ < ξ ( D Θ ) . However, for ξ ′ = O ( ξ ( D Θ ) /p ), Corollary 22 yields a scaling condition similar to inequal-ity (26), which we wish to avoid. As motivated by our computations of the score functionsfor small DAGs (see Remark 14 in Section 4.4), the difference { score Ω ( G ) − score Ω ( G ) } seems to increase linearly with the number of edge reversals needed to transform G to G . Hence, we might expect γ Ω ( G, G ) to remain roughly constant, rather than decreasinglinearly with p . The following lemma verifies this intuition in a special case. For a reviewof junction tree terminology, see Appendix A.1. Lemma 23
Suppose the moralized graph M ( G ) admits a junction tree representation withonly singleton separator sets. Let C , . . . , C k denote the maximal cliques in M ( G ) , and let { G ℓ } kℓ =1 denote the corresponding restrictions of G to the cliques. Then γ Ω ( G ) ≥ min ≤ ℓ ≤ k γ Ω ( G ℓ ) , where γ Ω ( G ℓ ) := min G ℓ ∈D Θ | Cℓ ,G ℓ G ℓ ( max G ℓ ⊇ G ℓ (cid:26) score Ω ( G ℓ ) − score Ω ( G ℓ ) | H ( G ℓ , G ℓ ) | (cid:27)) is the gap ratio computed over DAGs restricted to clique C ℓ that are consistent with themoralized graph. oh and B¨uhlmann The proof is contained in Appendix D.5.We might expect the gap ratio γ Ω ( G ℓ ) to be a function of the size of the clique. Inparticular, if the treewidth of M ( G ) is bounded by w and we have γ Ω ( G ℓ ) ≥ ξ w for all ℓ ,Lemma 23 implies that γ Ω ( G ) ≥ ξ w , and we only need the parameter ξ ′ appearing in Assumption 2 to be larger than ξ w , ratherthan scaling as the inverse of p . We expect a version of Lemma 23 to hold for graphswith bounded treewidth even when the separator sets have larger cardinality, but a fullgeneralization of Lemma 23 and a more accurate characterization of γ Ω ( G ) for arbitrarygraphs is beyond the scope of this paper. We now describe how our algorithm for DAG structure estimation in linear SEMs may beextended easily to accommodate systematically corrupted data. This refers to the settingwhere we observe noisy surrogates { z i } ni =1 in place of { x i } ni =1 . Two common examplesinclude the following:(a) Additive noise.
We have z i = x i + w i , where w i ⊥⊥ x i is additive noise with knowncovariance Σ w .(b) Missing data.
This is one instance of the more general setting of multiplicative noise.For each 1 ≤ j ≤ p , and independently over coordinates, we have z ij = ( x ij , with probability 1 − α,⋆, with probability α, where the missing data probability α is either estimated or known.We again divide our discussion into two parts: estimating Θ and computing scorefunctions based on corrupted data. Following the observation of Loh and Wainwright (2013), the graphical Lasso (17) may stillbe used to estimate the inverse covariance matrix Θ in the high-dimensional setting, wherewe plug in a suitable estimator b Γ for the covariance matrix Σ = cov( x i ), based on thecorrupted observations { z i } ni =1 . For instance, in the additive noise scenario, we may take b Γ = Z T Zn − Σ w , (29)and in the missing data setting, we may take b Γ = Z T Zn ⊙ M, (30)where ⊙ denotes the Hadamard product and M is the matrix with diagonal entries equalto − α and off-diagonal entries equal to − α ) . inear SEMs via inverse covariance estimation Assuming conditions such as sub-Gaussianity, the output b Θ of the modified graphi-cal Lasso (17) is statistically consistent under similar scaling as in the uncorrupted set-ting (Loh and Wainwright, 2013). For instance, in the additive noise setting, where the z i ’sare sub-Gaussian with parameter σ z , Lemma 16 holds with σ replaced by σ z . Analogousresults hold in the low-dimensional setting, when the expressions for b Γ in equations (29)and (30) are invertible with high probability, and we may simply use b Θ = ( b Γ) − . We now describe how to estimate score functions for DAGs based on corrupted data. Byequation (18), this reduces to estimating f σ j ( S ) = 1 σ j · E[( x j − b Tj x S ) ] , for a subset S ⊆ { , . . . , p }\{ j } , with | S | ≤ n . Note that σ j · f σ j ( S ) = Σ jj − b Tj Σ S,j + b Tj Σ SS b j = Σ jj − Σ j,S Σ − SS Σ S,j , since b j = Σ − SS Σ S,j .Let b Γ be the estimator for Σ based on corrupted data used in the graphical Lasso (e.g.,equations (29) and (30)). We then use the estimator e f σ j ( S ) = 1 σ j · (cid:16)b Γ jj − b Γ j,S b Γ − SS b Γ S,j (cid:17) . (31)Note in particular that equation (31) reduces to expression (19) in the fully-observed setting.We establish consistency of the estimator in equation (31), under the following deviationcondition on b Γ: P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)b Γ SS − Σ SS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ σ r dn + t !! ≤ c exp( − c nt ) , for any | S | ≤ d. (32)For instance, such a condition holds in the case of the sub-Gaussian additive noise model(cf. Lemma 29 in Appendix E), with b Γ given by equation (29), where σ = σ z .We have the following result, an extension of Lemma 18 applicable also for corruptedvariables: Lemma 24
Suppose b Γ satisfies the deviation condition (32) . Suppose | N Θ ( j ) | ≤ d for all j . Then | e f σ j ( S ) − f σ j ( S ) | ≤ c σ σ j r log pn , ∀ j and S ⊆ N Θ ( j ) , with probability at least − c exp( − c log p ) . The proof is contained in Appendix D.6. In particular, Corollary 22, providing guaranteesfor statistical consistency, also holds. oh and B¨uhlmann
6. Computational considerations
In practice, the main computational bottleneck in inferring the DAG structure comes fromhaving to compute score functions over a large number of DAGs. The simplest approachof searching over all possible permutation orderings of indices gives rise to p ! candidateDAGs, which scales exponentially with p . In this section, we describe how the result ofTheorem 2 provides a general framework for achieving vast computational savings for findingthe best-scoring DAG when data are generated from a linear SEM. We begin by reviewingexisting methods, and describe how our results may be used in conjunction with dynamicprogramming to produce accurate and efficient DAG learning. Following the literature, we call a score function over DAGs decomposable if it may bewritten as a sum of score functions over individual nodes, each of which is a function ofonly the node and its parents:score( G ) = p X j =1 score j (Pa G ( j )) . Note that we allow the score functions to differ across nodes. Consistent with our earliernotation, the goal is to find the DAG G ∈ D that minimizes score( G ).Some common examples of decomposable scores that are used for DAG inference includemaximum likelihood, BDe, BIC, and AIC (Chickering, 1995). By equation (18), the squared ℓ -score is clearly decomposable, and it gives an example where score j differs over nodes.Another interesting example is the nonparametric maximum likelihood, which extends theordinary likelihood method for scoring DAGs (Nowzohour and B¨uhlmann, 2013).Various recent results have focused on methods for optimizing a decomposable scorefunction over the space of candidate DAGs in an efficient manner. Some methods includeexhaustive search (Silander and Myllymaki, 2006), greedy methods (Chickering, 2002), anddynamic programming (Ordyniak and Szeider, 2012; Korhonen and Parviainen, 2013). Wewill focus here on a dynamic programming method that takes as input an undirected graphand outputs the best-scoring DAG with skeleton contained in the input graph. In this section, we detail a method due to Ordyniak and Szeider (2012) that will be usefulfor our purposes. Given an input undirected graph G I and a decomposable score function,the dynamic programming algorithm finds a DAG with minimal score that has skeletoncontained in G I . Let { N I ( j ) } pj =1 denote the neighborhood sets of G I . The runtime ofthe dynamic programming algorithm is exponential in the treewidth w of G I ; hence, thealgorithm is only tractable for bounded-treewidth graphs.The main steps of the dynamic programming algorithm are as follows. For a reviewof basic terminology of graph theory, including treewidth and tree decompositions, seeAppendix A; for further details and a proof of correctness, see Ordyniak and Szeider (2012).1. Construct a tree decomposition of G I with minimal treewidth. inear SEMs via inverse covariance estimation
2. Construct a nice tree decomposition of the graph. Let χ ( t ) denote the subset of { , . . . , p } associated to a node t in the nice tree decomposition.3. Starting from the leaves of the nice tree decomposition up to the root, compute the record for each node t . The record R ( t ) is the set of tuples ( a, p, s ) corresponding tominimal-scoring DAGs defined on the vertices χ ∗ ( t ) in the subtree attached to t , withskeleton contained in G I . For each such DAG, s is the score, a lists the parent setsof vertices in χ ( t ), such that a ( v ) ⊆ N I ( v ) for each v ∈ χ ( t ), and a ( v ) restricted to χ ∗ ( t ) agrees with the partial DAG; and p lists the directed paths between vertices in χ ( t ). The records R ( t ) may computed recursively over the nice tree decomposition asfollows: • Join node:
Suppose t has children t and t . Then R ( t ) is the union of tuples( a, p, s ) formed by tuples ( a , p , s ) ∈ R ( t ) and ( a , p , s ) ∈ R ( t ), where (1) a = a = a ; (2) p is the transitive closure of p ∪ p ; (3) p contains no cycles;and (4) s = s + s . • Introduce node:
Suppose t is an introduce node with child t ′ , such that χ ( t ) = χ ( t ′ ) ∪ { v } . Then R ( t ) is the set of tuples ( a, p, s ) formed by pairs P ⊆ N I ( v )and ( a ′ , p ′ , s ′ ) ∈ R ( t ′ ), such that (1) a ( v ) = P ; (2) for every v ∈ χ ( t ′ ), we have a ( v ) = a ′ ( v ); (3) p is the transitive closure of p ′ ∪ { ( u, v ) : u ∈ P } ∪ { ( v , u ) : v ∈ a ′ ( u ) , u ∈ χ ( t ′ ) } ; (4) p contains no cycles; and (5) s = s ′ . • Forget node:
Suppose t is a forget node with child t ′ , such that χ ( t ′ ) = χ ( t ) ∪{ v } . Then R ( t ) is the set of tuples ( a, p, s ) formed from tuples ( a ′ , p ′ , s ′ ) ∈ R ( t ′ ),such that (1) a ( u ) = a ′ ( u ) , ∀ u ∈ χ ( t ); (2) p = { ( u, v ) ∈ p ′ : u, v ∈ χ ( t ) } ; and (3) s = s ′ + score v ( a ′ ( v )).Note that Korhonen and Parviainen (2013) present a variant of this dynamic program-ming method, also using a nice tree decomposition, which is applicable even for graphswith unbounded degree but bounded treewidth. They assume that the starting undirectedgraph G I is a superset of the moralized DAG. Their algorithm runs in time linear in p andexponential in w . From a theoretical perspective, we are agnostic to both methods, sinceour results on statistical consistency of the graphical Lasso require the true moralized graphto have bounded degree. However, since supp(Θ ) exactly corresponds to the edge set of M ( G ), the alternative method will also lead to correct recovery. In practice, the relativeefficiency of the two dynamic programming algorithms will rely heavily on the structure of M ( G ). We first review the runtime of various components of the dynamic programming algorithmdescribed in Section 6.2. This is mentioned briefly in Korhonen and Parviainen (2013), butwe include details here for completeness. In our calculations, we assume the treewidth w of G is bounded and treat w as a constant.The first step involves constructing a tree decomposition of minimal treewidth w , whichmay be done in time O ( p ). The second step involves constructing a nice tree decomposition.Given a tree decomposition of width w , a nice tree decomposition with O ( p ) nodes and oh and B¨uhlmann treewidth w may be constructed in O ( p ) time (see Appendix A.2). Finally, the third stepinvolves computing records for nodes in the nice tree decomposition. We consider the threedifferent types of nodes in succession. Note that |R ( t ) | ≤ ( w +1)( w + d ) , ∀ t, (33)where d = max j | N I ( j ) | . This is because the number of choices of parent sets of any vertexin χ ( t ) is bounded by 2 d , leading to a factor of 2 d ( w +1) , and the number of possible pairsthat are connected by a path is bounded by 2 ( w +1) w . • If t is a join node with children t and t , we may compute R ( t ) by comparingpairs of records in R ( t ) and R ( t ); by inequality (33), this may be done in time O (2 w +1)( w + d ) ). • If t is an introduce node with child t ′ , we may compute R ( t ) by considering recordsin R ( t ′ ) and parent sets of the introduced node v . Since the number of choicesfor the latter is bounded by 2 d , we conclude that R ( t ) may be computed in time O (2 ( w +1)( w + d )+ d ). • Clearly, if t is a forget node, then R ( t ) may be computed in time O (2 ( w +1)( w + d ) ).Altogether, we conclude that all records of nodes in the nice tree decomposition maybe computed in time O ( p · w +1)( w + d ) ). Combined with the graphical Lasso preprocessingstep for estimating M ( G ), this leads to an overall complexity of O ( p ). This may becompared to the runtime of other standard methods for causal inference, including the PCalgorithm (Spirtes et al., 2000), which has computational complexity O ( p w ), and (direct)LiNGAM (Shimizu et al., 2006, 2011), which requires time O ( p ). It has been noted thatboth the PC and LiNGAM algorithms may be expedited when prior knowledge about theDAG space is available, further highlighting the power of Theorem 2 as a preprocessing stepfor any causal inference algorithm.
7. Discussion
We have provided a new framework for estimating the DAG corresponding to a linear SEM.We have shown that the inverse covariance matrix of linear SEMs always reflects the edgestructure of the moralized graph, even in non-Gaussian settings, and the reverse statementalso holds under a mild faithfulness assumption. Furthermore, we have shown that whenthe error variances are known up to close precision, a simple weighted squared ℓ -loss maybe used to select the correct DAG. As a corollary, we have established identifiability forthe class of linear SEMs with error variances specified up to a constant multiple. We haveproved that our methods are statistically consistent, under reasonable assumptions on thegap between the score of the true DAG and the next best DAG in the model class. Acharacterization of this gap parameter for various graphical structures is the topic of futurework.We have also shown how dynamic programming may be used to select the best-scoringDAG in an efficient manner, assuming the treewidth of the moralized graph is small. Ourresults relating the inverse covariance matrix to the moralized DAG provide a powerful inear SEMs via inverse covariance estimation method for reducing the DAG search space as a preprocessing step for dynamic program-ming, and are the first to provide rigorous guarantees for when the graphical Lasso may beused in non-Gaussian settings. Note that the dynamic programming algorithm only usesthe information that the true DAG has skeleton lying in the input graph, and does not incorporate any information about (a) the fact that the data comes from a linear SEM;or (b) the fact that the input graph exactly equals the moralized DAG. Intuitively, bothtypes of information should place significant constraints on the restricted DAG space, lead-ing to further speedups in the dynamic programming algorithm. Perhaps these restrictionswould make it possible to establish a version of dynamic programming for DAGs where themoralized graph has bounded degree but large treewidth.An important open question concerns scoring candidate DAGs when the diagonal matrixΩ of error variances is unknown. As we have seen, using the weighted squared ℓ -loss toscore DAGs may produce a graph that is far from the true DAG when Ω is misspecified.Alternatively, it would be useful to have a checkable condition that would allow us to verifywhether a given matrix Ω will correctly select the true DAG, or to be able to select the trueΩ from among a finite collection of candidate matrices. Acknowledgments
We acknowledge all the members of the Seminar f¨ur Statistik for providing an immenselyhospitable and fruitful environment when PL was visiting ETH, and the Forschungsinstitutf¨ur Mathematik at ETH Z¨urich for financial support. We also thank Varun Jog for helpfuldiscussions. PL was additionally supported by a Hertz Fellowship and an NSF GraduateResearch Fellowship.
Appendix A. Graph-theoretic concepts
In this Appendix, we review some fundamental concepts in graph theory that we use inour exposition. We begin by discussing junction trees, and then move to the related notionof tree decompositions. Note that these are purely graph-theoretic operations that may beperformed on an arbitrary undirected graph.
A.1 Junction trees
We begin with the basic junction tree framework. For more details, see Lauritzen (1996)or Koller and Friedman (2009).For an undirected graph G = ( V, E ), a triangulation is an augmented graph e G = ( V, e E )that contains no chordless cycles of length greater than three. By classical graph theory,any triangulation e G gives rise to a junction tree representation of G , where nodes in thejunction tree are subsets of V corresponding to maximal cliques of e G , and the intersectionof any two adjacent cliques C and C in the junction tree is referred to as a separator set S = C ∩ C . Furthermore, any junction tree must satisfy the running intersection property ,meaning that for any two nodes in the junction tree, say corresponding to cliques C j and C k , the intersection C j ∩ C k must belong to every separator set on the unique path between oh and B¨uhlmann C j and C k in the junction tree. The treewidth of G is defined to be one less than the sizeof the largest clique in any triangulation e G of G , minimized over all triangulations.As a classic example, note that if G is a tree, then G is already triangulated, and thejunction tree parallels the tree structure of G . The maximal cliques in the junction treeare equal to the edges of G and the separator sets correspond to singleton vertices. Thetreewidth of G is consequently equal to 1. A.2 Tree decompositions
We now highlight some basic concepts of tree decompositions and nice tree decompositionsused in our dynamic programming framework. Our exposition follows Kloks (1994).Let G = ( V, E ) be an undirected graph. A tree decomposition of G is a tree T with nodeset W such that each node t ∈ W is associated with a subset V t ⊆ V , and the followingproperties are satisfied:(a) S t ∈ T V t = V ;(b) for all ( u, v ) ∈ E , there exists a node t ∈ W such that u, v ∈ V t ;(c) for each v ∈ V , the set of nodes { t : v ∈ V t } forms a subtree of T .The width of the tree decomposition is max t ∈ T | V t | −
1. The treewidth of G is the minimalwidth of any tree decomposition of G ; this quantity agrees with the treewidth definedin terms of junction trees in the previous section. If G has bounded treewidth, a treedecomposition with minimum width may be constructed in time O ( | V | ) (cf. Chapter 15of Kloks (1994)).A nice tree decomposition is rooted tree decomposition satisfying the following proper-ties:(a) every node has at most two children;(b) if a node t has two children r and s , then V t = V r = V s ;(c) if a node t has one child s , then either(i) | V t | = | V s | + 1 and V s ⊆ V t , or(ii) | V s | = | V t | + 1 and V t ⊆ V s .Nodes of the form (b), (c)(i), and (c)(ii) are called join nodes, introduce nodes, and forget nodes, respectively. Given a tree decomposition of G with width w , a nice tree decompo-sition with width w and at most 4 | V | nodes may be computed in time O ( | V | ) (cf. Lemma13.1.3 of Kloks (1994)). Appendix B. Matrix derivations
In this section, we present a few matrix results that are used to prove Theorem 7.Define a unit lower triangular (LT) matrix to be a lower triangular matrix with 1’son the diagonal. Recall that matrices A and B are permutation similar if there exists apermutation matrix P such that A = P BP T . Call a matrix permutation unit LT if it ispermutation similar to a unit lower triangular matrix. We have the following lemma: inear SEMs via inverse covariance estimation Lemma 25
Suppose A and B are permutation unit LT matrices, and suppose AA T = BB T .Then A = B . Proof
Under the appropriate relabeling, we assume without loss of generality that A isunit LT. There exists a permutation matrix P such that C := P BP T is also unit LT. Wehave P AA T P T = CC T . (34)Let π be the permutation on { , . . . , n } such that P i,π ( i ) = 1 for all i , and P has 0’severywhere else. Define the notation e a ij := ( P A ) ij , and m ij := ( CC T ) ij , and let { c ij } denote the entries of C . We will make use of the following equalities, whichfollow from equation (34) and the fact that C is unit LT: X k e a ik e a jk = m ij = X k
1. We first show that e a i,π ( j ) = c ij for all j < i by asub-induction on j . For j = 1, we have by equation (35) and the base result for i = 1 that e a i,π (1) = m i, = c i, , which is exactly what we want. For the sub-induction step, consider 1 < j < i , and suppose e a i,π ( ℓ ) = c iℓ for all ℓ < j . Note that e a j,π ( ℓ ) = 0 for all ℓ > j by the outer induction hypothesis.Hence, equation (35) and the fact that e a j,π ( j ) = 1 gives X ℓ
Suppose A ∈ R n × n is positive definite with det( A ) = 1 . Then min { tr( AB ) : B ≻ and det( B ) = 1 } = n. Proof
Consider the singular value decomposition A = U Λ U ∗ , and note that tr( AB ) =tr(Λ( U ∗ BU )). Denote b ij := ( U ∗ BU ) ij and λ i := Λ ii . Then by the AM-GM inequality andHadamard’s inequality, we have1 n · tr(Λ( U ∗ BU )) ≥ Y i λ i b ii ! /n = det( A ) · Y i b ii ! /n ≥ (det( U ∗ BU )) /n = 1 , implying the result.Building upon Lemmas 25 and 26, we obtain the following result: Lemma 27
Suppose A and B are n × n permutation unit LT matrices. Then min B tr( AA T B T B ) ≥ n, (39) with equality achieved if and only if B = A − . Proof
Write A ′ = AA T and B ′ = B T B , and note that since det( A ) = det( B ) = 1, we alsohave det( A ′ ) = det( B ′ ) = 1. Then inequality (39) holds by Lemma 26.To recover the conditions for equality, note that equality holds in Hadamard’s inequalityif and only if some b ii = 0 or the matrix U ∗ BU is diagonal. Note that the first case is notpossible, since U ∗ BU ≻
0. In the second case, we see that in addition, we need b ii = λ i forall i in order to achieve equality in the AM-GM inequality. It follows that U ∗ BU = Λ − ,so AA T = B − B − T .Since A and B are permutation unit LT, Lemma 25 implies that the last equality canonly hold when B = A − . Appendix C. Proofs for population-level results
In this section, we provide proofs for the remaining results in Sections 3 and 4. inear SEMs via inverse covariance estimation C.1 Proof of Lemma 1
We first show that Ω is a diagonal matrix. Consider j < k ; we will show that ǫ j ⊥⊥ ǫ k , fromwhich we conclude that E[ ǫ j ǫ k ] = E[ ǫ j ] · E[ ǫ k ] = 0 . Indeed, we have ǫ k ⊥⊥ ( X , . . . , X k − ) by assumption. Since ǫ j = X j − b Tj X is a deterministicfunction of ( X , . . . , X j ), it follows that ǫ k ⊥⊥ ǫ j , as claimed.Turning to equations (4) and (5), note that since B ∈ U , the matrix ( I − B ) is alwaysinvertible, and by equation (3), we haveΣ = ( I − B ) − T Ω( I − B ) − , (40)and Θ = Σ − = ( I − B )Ω − ( I − B ) T . (41)Then expanding and using the fact that B is upper triangular and Ω is diagonal, we obtainequations (4) and (5). C.2 Proof of Lemma 6
Since G ⊆ G , we have Pa G ( j ) ⊆ Pa G ( j ), for each j . Furthermore, no element of Pa G ( j )may be a descendant of j in G , since this would contradict the fact that G contains nocycles. By the Markov property of G , we therefore have X j ⊥⊥ X Pa G ( j ) \ Pa G ( j ) | X Pa G ( j ) . Thus, the linear regression coefficients for X j regressed upon X Pa G ( j ) are simply the linearregression coefficients for X j regressed upon X Pa G ( j ) (and the remaining coefficients for X Pa G ( j ) \ Pa G ( j ) are zero). By Remark 5, we conclude that B = B G = arg min B ∈U G { score Ω ( B ) } , and the uniqueness of B follows from the uniqueness of B G . C.3 Proof of Theorem 9
From the decomposition (6), it is easy to see that for any B ∈ U , we have a min ≤ score Ω ( B )score Ω ( B ) ≤ a max , (42)simply by comparing individual terms; e.g.,1(Ω ) jj · E[( X j − b Tj X ) ] ≤ max j (cid:26) / (Ω ) jj / (Ω ) jj (cid:27) · ) jj · E[( X j − b Tj X ) ]= a max (cid:18) ) jj · E[( X j − B Tj X ) ] (cid:19) . oh and B¨uhlmann Note that if G ⊇ G , then by Lemma 6, the matrix B is the unique minimizer ofscore Ω ( B ) among the class U G . Now consider G G and B ∈ U G . We have (cid:18) ξp (cid:19) · score Ω ( B ) = min G ′ ∈D , G ′ G { score Ω ( G ′ ) } ≤ score Ω ( G ) ≤ score Ω ( B ) , (43)where we have used the definition of the gap (10) and the fact that score Ω ( B ) = p byTheorem 7 in the first inequality. Hence,score Ω ( B ) ≤ a max · score Ω ( B ) ≤ a max ξ/p · score Ω ( B ) ≤ a max a min (1 + ξ/p ) · score Ω ( B ) , where the first and third inequalities use inequality (42), and the second inequality usesinequality (43). By the assumption (11), it follows thatscore Ω ( B ) ≤ score Ω ( B ) , as wanted. The statement regarding strict inequality is clear. C.4 Proof of Lemma 11
We first consider the case when r ≥
1. Then a max a min = r , so combining Theorem 9 with theexpression (14), we have the sufficient condition r ≤ b r + b r ) . Rearranging gives b − r ( r − b − r ( r − ≥ , which is equivalent to b ≥ r ( r −
1) + p (4 r ( r − + 8 r ( r − . Simplifying yields the desired expression.If instead r ≤
1, we have a max a min = r , so the sufficient condition becomes1 r ≤ b r + b r ) , which is equivalent to b − − r ) b − r (1 − r ) ≥ , or b ≥ − r ) + p − r ) + 8 r (1 − r )2 . Simplifying further yields the expression. inear SEMs via inverse covariance estimation c c c d d d e e e f f f (a) (b) (c) (d) Figure 3:
Alternative DAGs.
C.5 Proof of Lemma 13
To compute the gap ξ , it is sufficient to consider the graphs in Figure 3. Indeed, we havescore Ω ( G ) ≤ score Ω ( G ′ ) whenever G ′ ⊆ G , so we only need to consider maximal elementsin the poset of DAGs not containing G . Consider the graphs given by autoregressionmatrices C = c
00 0 0 c c , E = e e e , corresponding to the DAGs in panels (a) and (c) of Figure 3. A simple calculation showsthatscore Ω ( C ) = 3 + c b + c b + c b d d + (cid:18) c d d + c b d d (cid:19) + (cid:18) c d d − b d d (cid:19) + (cid:18) c d d − b d d (cid:19) , which is minimized for c = − b c , c = b d d + b d d + b , c = b d d + b , leading to ξ = min c ,c ,c { score Ω ( C ) − score Ω ( B ) } = b d d + b d d + b + b b d d d d + b d d + b d d d . (44)Similarly, we may computescore Ω ( E ) = 3 + (cid:18) e d d + e b d d (cid:19) + (cid:18) e d d − b d d (cid:19) + (cid:18) e d d − b d d (cid:19) , which is minimized for e = − b , e = b , e = b d d + b , oh and B¨uhlmann leading to ξ = min e ,e ,e { score Ω ( E ) − score Ω ( B ) } = b d d + b d d . (45)Finally, note that the graphs in panels (b) and (d) of Figure 3 are mirror images of thegraphs in panels (a) and (c), respectively. Hence, we obtain ξ = min d ,d ,d { score Ω ( D ) − score Ω ( B ) } = b d d + b d d + b + b b d d d d + b d d + b d d d ,ξ = min f ,f ,f { score Ω ( F ) − score Ω ( B ) } = b d d + b d d , simply by swapping the roles of nodes 1 and 2. Taking ξ = min { ξ , ξ , ξ , ξ } then yieldsthe desired result. Appendix D. Proofs for statistical consistency
In this Appendix, we provide the proofs for the lemmas on statistical consistency stated inSection 5.
D.1 Proof of Lemma 15
This result follows from the fact that k b Θ − Θ k max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , together with results on the spectral norm of sub-Gaussian covariances and their inverses(see Lemma 29 in Appendix E). D.2 Proof of Lemma 18
First consider a fixed pair ( j, S ) such that S ⊆ N Θ ( j ). We may write x j = b Tj x S + e j , (46)where e j has zero mean and is uncorrelated with x S (and also depends on the choice of S ).In matrix notation, we have b b j = ( X TS X S ) − ( X TS X j ) = ( X TS X S ) − X TS ( X S b j + E j ) = b j + ( X TS X S ) − X TS E j , where the second equality follows from equation (46). Hence, σ j · b f σ j ( S ) = 1 n k X j − X S b b j k = 1 n k X S ( b j − b b j ) + E j k = 1 n k ( I − ( X TS X S ) − X TS ) E j k . (47) inear SEMs via inverse covariance estimation By the triangle inequality, we have (cid:12)(cid:12)(cid:12) k ( I − ( X TS X S ) − X TS ) E j k − k E j k (cid:12)(cid:12)(cid:12) ≤ k ( X TS X S ) − X TS E j k ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X TS X S ) − X TS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · k E j k . (48)Furthermore, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X TS X S ) − X TS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X S ( X TS X S ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup k v k ≤ (cid:8) v T ( X TS X S ) − X TS X S ( X TS X S ) − v (cid:9) / = sup k v k ≤ (cid:8) v T ( X TS X S ) − v (cid:9) / = 1 √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) X TS X S n (cid:19) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / ≤ C √ n (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Σ − SS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Σ − SS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · max { δ, δ } (cid:17) / , with probability at least 1 − − cnt ), where δ = c ′ q | S | n + c ′′ t , by Lemma 29. Takinga union bound over all 2 d choices for S and p choices for j , and setting t = c q d +log pn , wehave (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X TS X S ) − X TS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ′ √ n , ∀ S s.t. S ⊆ N Θ ( j ) for some j, (49)with probability at least1 − c exp( − c nt ) = 1 − c exp( − c nt + d log 2 + log p ) ≥ − c exp( − c ′ ( d + log p )) . Combining inequalities (48) and (49), we have the uniform bound (cid:18) − C ′ √ n (cid:19) k E j k ≤ k I − ( X TS X S ) − X TS ) E j k ≤ (cid:18) C ′ √ n (cid:19) k E j k , w.h.p., which together with equation (47) implies that (cid:12)(cid:12)(cid:12)(cid:12) σ j · b f σ j ( S ) − n k E j k (cid:12)(cid:12)(cid:12)(cid:12) ≤ C ′ √ n · n k E j k , (50)using the fact thatmax { − (1 − a ) , (1 + a ) − } = max { a − a , a + a } = 2 a + a ≤ a, for a = C ′ √ n sufficiently small. Furthermore,1 n E[ k E j k ] = σ j · f σ j ( S ) . oh and B¨uhlmann Note that the e j ’s are i.i.d. sub-Gaussians with parameter at most cσ , since we may write e j = e b Tj x for the appropriate e b j ∈ R p , and k e b k is bounded in terms of the eigenvalues of Σ.Applying the usual sub-Gaussian tail bounds, we then have P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) n k E j k − n E[ k E j k ] (cid:12)(cid:12)(cid:12)(cid:12) ≥ cσ t (cid:19) ≤ c exp( − c nt ) , ∀ j, and taking a union bound over j and setting t = c ′ q log pn givesmax j (cid:12)(cid:12)(cid:12)(cid:12) n k E j k − n E[ k E j k ] (cid:12)(cid:12)(cid:12)(cid:12) ≤ c σ r log pn , (51)with probability at least 1 − c exp( − c log p ). Combining inequalities (50) and (51), itfollows that σ j | b f σ j ( S ) − f σ j ( S ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12) σ j · b f σ j ( S ) − n k E j k (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) n k E j k − n E[ k E j k ] (cid:12)(cid:12)(cid:12)(cid:12) ≤ C ′ √ n n E[ k E j k ] + c σ r log pn ! + c σ r log pn ≤ c ′ σ r log pn , with probability at least 1 − c exp( − c log p ). D.3 Proof of Lemma 19
Combining inequalities (20) and (22) and using the triangle inequality, we have (cid:12)(cid:12) [ score Ω ( G ) − score Ω ( G ) (cid:12)(cid:12) ≤ p X j =1 (cid:12)(cid:12) [ score σ j (Pa G ( j )) − score σ j (Pa G ( j )) (cid:12)(cid:12) < ξ ( D Θ )2 , (52)for all G ∈ D Θ . In particular, for G ∈ D Θ such that G G , we have [ score Ω ( G ) < score Ω ( G ) + ξ ( D Θ )2 ≤ (score Ω ( G ) − ξ ( D Θ )) + ξ ( D Θ )2 < [ score Ω ( G ) , where the first and third inequalities use inequality (52) and the second inequality uses thedefinition of the gap ξ ( D Θ ). This implies inequality (23). D.4 Proof of Lemma 21
Consider G ∈ D Θ with G G , and consider G ⊇ G such that γ Ω ( G, G ) is maximized.Note that if Pa G ( j ) = Pa G ( j ), then certainly, b f σ j (Pa G ( j )) − b f σ j (Pa G ( j )) = 0 = f σ j (Pa G ( j )) − f σ j (Pa G ( j )) . inear SEMs via inverse covariance estimation Hence, (cid:12)(cid:12)(cid:0) [ score Ω ( G ) − [ score Ω ( G ) (cid:1) − (cid:0) score Ω ( G ) − score Ω ( G ) (cid:1)(cid:12)(cid:12) ≤ | H ( G, G ) | · ξ ′ , (53)using inequality (28) and the triangle inequality. Furthermore, by inequality (27), | H ( G, G ) | · ξ ′ ≤ score Ω ( G ) − score Ω ( G ) ξ ′ · ξ ′ Ω ( G ) − score Ω ( G )2 . (54)Combining inequalities (53) and (54) gives [ score Ω ( G ) − [ score Ω ( G ) ≥ score Ω ( G ) − score Ω ( G )2 = score Ω ( G ) − score Ω ( G )2 > , where the last inequality holds because of the assumption ξ ′ >
0. Hence, G arg min G ∈D Θ { [ score Ω ( G ) } , implying the desired result. D.5 Proof of Lemma 23
We begin with a simple lemma:
Lemma 28
Suppose M ( G ) admits a junction tree representation with only singleton sep-arators, and let C , . . . , C k denote the maximal cliques. If X follows a linear SEM over G ,then the marginal distribution of X over the nodes in any clique C ℓ also follows a linearSEM over C ℓ , with DAG structure specified by G ℓ , the restriction of G to C ℓ . In addition,the autoregression matrix for the marginal SEM is simply the autoregression matrix for thefull SEM restricted to the nodes in C ℓ . Proof
We relabel the nodes of G so that the natural ordering on { , . . . , p } is a topologicalorder. Clearly, this induces a topological order over the nodes of G ℓ , as well. Recall thatwe have equation (3); i.e., for each j , X j = b Tj X j − + ǫ j , where ǫ j ⊥⊥ ( X , . . . , X j − ) . (55)For each j ∈ C ℓ , we define ǫ ′ j := ǫ j + X k
0, implies P ℓ a ℓ P ℓ b ℓ > ξ , we concludefrom equation (63) and inequalities (64) and (65) thatmax G ⊇ G { γ Ω ( G, G ) } ≥ P k ′′ ℓ =1 (cid:0) score Ω ( G ℓ ) − score Ω ( G ℓ ) (cid:1)P k ′′ ℓ =1 | H ( G ℓ , G ℓ ) | ≥ min ≤ ℓ ≤ k γ Ω ( G ℓ ) . Since this result holds uniformly over all G , we have γ Ω ( G ) ≥ min ≤ ℓ ≤ k γ Ω ( G ℓ ), as well. D.6 Proof of Lemma 24
This proof is quite similar to the proof for the fully-observed case, so we only mention thehigh-level details here.We write σ j | e f σ j ( S ) − f σ j ( S ) | = (cid:12)(cid:12)(cid:12)(cid:16)b Γ jj − b Γ j,S b Γ − SS b Γ − S,j (cid:17) − (cid:0) Σ jj − Σ j,S Σ − SS Σ S,j (cid:1)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)b Γ jj − Σ jj (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)b Γ j,S b Γ − SS b Γ S,j − Σ j,S Σ − SS Σ S,j (cid:12)(cid:12)(cid:12)| {z } A . (66)The first term may be bounded directly using inequality (32) and a union bound over j : P max j (cid:12)(cid:12)(cid:12)b Γ jj − Σ jj (cid:12)(cid:12)(cid:12) ≥ cσ r log pn ! ≤ c ′ exp( − c ′ log p ) . (67)To bound the second term, we use the following expansion: A ≤ (cid:12)(cid:12)(cid:12)b Γ j,S (cid:16)b Γ − SS − Σ − SS (cid:17) b Γ S,j (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)b Γ j,S Σ − SS (cid:16)b Γ S,j − Σ S,j (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:16)b Γ j,S − Σ j,S (cid:17) Σ − SS Σ S,j (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)b Γ − SS − Σ − SS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k b Γ S,j k + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Σ − SS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) k b Γ S,j k k b Γ S,j − Σ S,j k + k b Γ S,j − Σ S,j k k Σ S,j k (cid:17) . oh and B¨uhlmann As in the proof of Lemma 29 in Appendix E, we may obtain a bound of the form P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)b Γ − SS − Σ − SS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ cσ r dn + t !! ≤ c exp( − c nt ) , by inverting the deviation condition (32). Furthermore, k b Γ S,j − Σ S,j k ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)b Γ S ′ S ′ − Σ S ′ S ′ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where S ′ := S ∪ { j } , which may in turn be bounded using the deviation condition (32). Wealso have k b Γ S,j k ≤ k Σ S,j k + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)b Γ S ′ S ′ − Σ S ′ S ′ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Combining these results and taking a union bound over the 2 d choices for S and p choicesfor j , we arrive at a uniform bound of the form P A ≤ c ′ σ r log pn ! ≥ − c ′ exp( − c ′ log p ) . Together with inequality (67) and the expansion (66), we then obtain the desired result.
Appendix E. Matrix concentration results
This Appendix contains matrix concentration results that are used to prove our technicallemmas. We use |||·||| to denote the spectral norm of a matrix. Lemma 29
Suppose { x i } ni =1 ⊆ R p are i.i.d. sub-Gaussian vectors with parameter σ andcovariance Σ . Then for all t ≥ , we have P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X T Xn − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ σ · max { δ, δ } (cid:19) ≥ − − cnt ) , (68) where δ = c ′ q pn + c ′′ t . Furthermore, if X T Xn is invertible and σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Σ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · max { δ, δ } ≤ , we have P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) X T Xn (cid:19) − − Σ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Σ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · max { δ, δ } ! ≥ − − cnt ) . (69) Proof
For inequality (68), see Remark 5.40 of Vershynin (2012). For inequality (69), weuse the matrix expansion( A + ∆) − = ( A ( I + A − ∆)) − = ( I + A − ∆) − A − = A − + ∞ X k =1 ( − k ( A − ∆) k A − , inear SEMs via inverse covariance estimation valid for any matrices A and ∆ such that A and A + ∆ are both invertible and the seriesconverges. By the triangle inequality and multiplicativity of the spectral norm, we thenhave (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( A + ∆) − − A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∞ X k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( A − ∆) k A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · ∞ X k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ||| A − ∆ ||| ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · ||| ∆ ||| − ||| A − ∆ ||| . We now take A = Σ and ∆ = X T Xn − Σ. By the assumption and inequality (68), we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · ||| ∆ ||| ≤ , implying that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( A + ∆) − − A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · ||| ∆ ||| . This gives the result.
References
O.O. Aalen, K. Røysland, J.M. Gran, and B. Ledergerber. Causality, mediation and time:A dynamic viewpoint.
Journal of the Royal Statistical Society: Series A (Statistics inSociety) , 175(4):831–861, 2012.P. B¨uhlmann, J. Peters, and J. Ernest. CAM: Causal additive models, high-dimensionalorder search, and penalized regression. arXiv e-prints , October 2013. Available at http://arxiv.org/abs/1310.1533 .D.M. Chickering. A transformational characterization of equivalent bayesian network struc-tures. In
Proceedings of the Eleventh conference on Uncertainty in Artificial Intelligence ,pages 87–98, 1995.D.M. Chickering. Optimal structure identification with greedy search.
Journal of MachineLearning Research , 3:507–554, 2002.R. A. Horn and C. R. Johnson.
Matrix Analysis . Cambridge University Press, 1990.T.R. Hughes, M.J. Marton, A.R. Jones, C.J. Roberts, R. Stoughton, C.D. Armour, H.A.Bennett, E. Coffey, H. Dai, Y.D. He, et al. Functional discovery via a compendium ofexpression profiles.
Cell , 102:109–126, 2000. oh and B¨uhlmann T. Kloks.
Treewidth: Computations and Approximations , volume 842. Springer, 1994.D. Koller and N. Friedman.
Probabilistic Graphical Models: Principles and Techniques .MIT Press, 2009.J.H. Korhonen and P. Parviainen. Exact learning of bounded tree-width bayesian networks.In
Artificial Intelligence and Statistics (AISTATS 2013) , pages 370–378. JMLR, 2013.S.L. Lauritzen.
Graphical Models . Oxford University Press, 1996.P. Loh and M.J. Wainwright. High-dimensional regression with noisy and missing data:Provable guarantees with non-convexity.
Annals of Statistics , 40(3):1637–1664, 2012.P. Loh and M.J. Wainwright. Structure estimation for discrete graphical models: General-ized covariance matrices and their inverses.
Annals of Statistics , 2013. To appear.N. Meinshausen and P. B¨uhlmann. High-dimensional graphs and variable selection with theLasso.
Annals of Statistics , 34:1436–1462, 2006.C. Nowzohour and P. B¨uhlmann. Score-based causal learning in additive noise models.2013. In preparation.S. Ordyniak and S. Szeider. Algorithms and complexity results for exact Bayesian structurelearning.
CoRR , abs/1203.3501, 2012.E. Perrier, S. Imoto, and S. Miyano. Finding optimal Bayesian network given a super-structure.
Journal of Machine Learning Research , 9(2):2251–2286, 2008.J. Peters and P. B¨uhlmann. Identifiability of gaussian structural equation mod-els with equal error variances. arXiv e-prints , August 2013. Available at http://arxiv.org/abs/1205.2536 . To appear in
Biometrika .P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covarianceestimation by minimizing ℓ -penalized log-determinant divergence. Electronic Journal ofStatistics , 4:935–980, 2011.S. Shimizu, P.O. Hoyer, A. Hyv¨arinen, and A. Kerminen. A linear non-Gaussian acyclicmodel for causal discovery.
Journal of Machine Learning Research , 7:2003–2030, 2006.S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyv¨arinen, Y. Kawahara, T. Washio, P.O. Hoyer, andK. Bollen. Directlingam: A direct method for learning a linear non-Gaussian structuralequation model.
Journal of Machine Learning Research , 12:1225–1248, 2011.A. Shojaie and G. Michailidis. Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs.
Biometrika , 97(3):519–538, 2010.T. Silander and P. Myllymaki. A simple approach for finding the globally optimal Bayesiannetwork structure. In
Proceedings of the Twenty-Second Conference Annual Conferenceon Uncertainty in Artificial Intelligence (UAI-06) , pages 445–452, Arlington, VA, 2006.AUAI Press. inear SEMs via inverse covariance estimation P. Spirtes, C. Glymour, and R. Scheines.
Causation, prediction, and search , volume 81.The MIT Press, 2000.D.J. Stekhoven, I. Moraes, G. Sveinbj¨ornsson, L. Hennig, M.H. Maathuis, and P. B¨uhlmann.Causal stability ranking.
Bioinformatics , 28(21):2819–2823, 2012.S. van de Geer and P. B¨uhlmann. ℓ -penalized maximum likelihood for sparse directedacyclic graphs. Annals of Statistics , 41(2):536–567, 2013.R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y.C. El-dar and G. Kutyniok, editors,
Compressed Sensing: Theory and Applications . CambridgeUniversity Press, 2012.M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model.
Biometrika , 94(1):19–35, 2007., 94(1):19–35, 2007.