[PDF] High-dimensional learning of linear causal networks via inverse covariance estimation

Abstract

We establish a new framework for statistical estimation of directed acyclic graphs (DAGs) when data are generated from a linear, possibly non-Gaussian structural equation model. Our framework consists of two parts: (1) inferring the moralized graph from the support of the inverse covariance matrix; and (2) selecting the best-scoring graph amongst DAGs that are consistent with the moralized graph. We show that when the error variances are known or estimated to close enough precision, the true DAG is the unique minimizer of the score computed using the reweighted squared l_2-loss. Our population-level results have implications for the identifiability of linear SEMs when the error covariances are specified up to a constant multiple. On the statistical side, we establish rigorous conditions for high-dimensional consistency of our two-part algorithm, defined in terms of a "gap" between the true DAG and the next best candidate. Finally, we demonstrate that dynamic programming may be used to select the optimal DAG in linear time when the treewidth of the moralized graph is bounded.

Full PDF

aa r X i v : . [ s t a t . M L ] N ov Linear SEMs via inverse covariance estimation

High-dimensional learning of linear causal networks viainverse covariance estimation

Po-Ling Loh [email protected]

Department of StatisticsUniversity of CaliforniaBerkeley, CA 94720, USA

Peter B¨uhlmann [email protected]

Seminar f¨ur StatistikETH Z¨urichSwitzerland

Abstract

We establish a new framework for statistical estimation of directed acyclic graphs(DAGs) when data are generated from a linear, possibly non-Gaussian structural equationmodel. Our framework consists of two parts: (1) inferring the moralized graph from thesupport of the inverse covariance matrix; and (2) selecting the best-scoring graph amongstDAGs that are consistent with the moralized graph. We show that when the error variancesare known or estimated to close enough precision, the true DAG is the unique minimizerof the score computed using the reweighted squared ℓ -loss. Our population-level resultshave implications for the identiﬁability of linear SEMs when the error covariances are spec-iﬁed up to a constant multiple. On the statistical side, we establish rigorous conditionsfor high-dimensional consistency of our two-part algorithm, deﬁned in terms of a “gap”between the true DAG and the next best candidate. Finally, we demonstrate that dynamicprogramming may be used to select the optimal DAG in linear time when the treewidth ofthe moralized graph is bounded. Keywords:

Causal inference, dynamic programming, identiﬁability, inverse covariancematrix estimation, linear structural equation models

1. Introduction

Causal networks arise naturally in a wide variety of application domains, including ge-netics, epidemiology, and time series analysis (Hughes et al., 2000; Stekhoven et al., 2012;Aalen et al., 2012). The task of inferring the graph structure of a causal network fromjoint observations is a relevant but challenging problem. Whereas undirected graphicalstructures may be estimated via pairwise conditional independence testing, with worst-casetime scaling as the square of the number of nodes, estimation methods for directed acyclicgraphs (DAGs) ﬁrst require learning an appropriate permutation order of the vertices,leading to computational complexity that scales exponentially in the graph size. Greedyalgorithms present an attractive computationally eﬃcient alternative, but such methodsare not generally guaranteed to produce the correct graph (Chickering, 2002). In contrast,exact methods for causal inference that search exhaustively over the entire DAG space mayonly be tractable for relatively small graphs (Silander and Myllymaki, 2006). oh and B¨uhlmann In practice, knowing prior information about the structure of the underlying DAG may leadto vast computational savings. For example, if a natural ordering of the nodes is known,inference may be performed by regressing each node upon its predecessors and selecting thebest functional ﬁt for each node. This yields an algorithm with runtime linear in the numberof nodes and overall quadratic complexity. In the linear high-dimensional Gaussian setting,one could apply a version of the graphical Lasso, where the feasible set is restricted to matri-ces that are upper-triangular with respect to the known ordering (Shojaie and Michailidis,2010). However, knowing the node order seems quite unrealistic in practice, except forcertain special cases. If instead a conditional independence graph or superset of the skele-ton is speciﬁed a priori, the number of required conditional independence tests may alsobe reduced dramatically. This appears to be a more reasonable assumption, and vari-ous authors have devised algorithms to compute the optimal DAG eﬃciently in settingswhere the input graph has bounded degree and/or bounded treewidth (Perrier et al., 2008;Ordyniak and Szeider, 2012; Korhonen and Parviainen, 2013).Unfortunately, appropriate tools for inferring such superstructures are rather limited,and the usual method of using the graphical Lasso to estimate a conditional independencegraph is rigorously justiﬁed only in the linear Gaussian setting (Yuan and Lin, 2007). Re-cent results have established that a version of the graphical Lasso may also be used to learna conditional independence graph for variables taking values in a discrete alphabet whenthe graph has bounded treewidth (Loh and Wainwright, 2013), but results for more gen-eral distributions are absent from the literature. B¨uhlmann et al. (2013) isolate suﬃcientconditions under which Lasso-based linear regression could be used to recover a conditionalindependence graph for general distributions, and use it as a prescreening step for nonpara-metric causal inference in additive noise models; however, it is unclear which non-Gaussiandistributions satisfy such conditions.

We propose a new algorithmic strategy for inferring the DAG structure of a linear, poten-tially non-Gaussian structural equation model (SEM). Deviating slightly from the literature,we use the term non-Gaussian to refer to the fact that the variables are not jointly Gaussian;however, we do not require non-Gaussianity of all exogenous noise variables, as assumedby Shimizu et al. (2011). We proceed in two steps, where each step is of independent in-terest: First, we infer the moralized graph by estimating the inverse covariance of the jointdistribution. The novelty is that we justify this approach for non-Gaussian linear SEMs.Second, we ﬁnd the optimal causal network structure by searching over the space of DAGsthat are consistent with the moralized graph and selecting the DAG that minimizes anappropriate score function. When the score function is decomposable and the moralizedgraph has bounded treewidth, the second step may be performed via dynamic program-ming in time linear in the number of nodes (Ordyniak and Szeider, 2012). Our algorithm isalso applicable in a high-dimensional setting when the moralized graph is sparse, where weestimate the support of the inverse covariance matrix using a method such as the graphicalLasso (Ravikumar et al., 2011). Our algorithmic framework is summarized in Algorithm 1: inear SEMs via inverse covariance estimation Algorithm 1

Framework for DAG estimation Input:

Data samples { x i } ni =1 from a linear SEM Obtain estimate b Θ of inverse covariance matrix (e.g., using graphical Lasso) Construct moralized graph c M with edge set deﬁned by supp( b Θ) Compute scores for DAGs that are consistent with c M (e.g., using squared ℓ -error) Find minimal-scoring b G (using dynamic programming when score is decomposableand c M has bounded treewidth) Output:

Estimated DAG b G We prove the correctness of our graph estimation algorithm by deriving new resultsabout the theory of linear SEMs. We present a novel result showing that for almost ev-ery choice of linear coeﬃcients, the support of the inverse covariance matrix of the jointdistribution is identical to the edge structure of the moralized graph. Although a similar re-lationship between the support of the inverse covariance matrix and the edge structure of anundirected conditional independence graph has long been established for multivariate Gaus-sian models (Lauritzen, 1996), our result does not involve any assumptions of Gaussianity,and the proof technique is completely new.Since we do not impose constraints on the error distribution of our SEM, standardparametric maximum likelihood methods are not applicable to score and compare candi-date DAGs. Consequently, we use the squared ℓ -error to score DAGs, and prove that inthe case of homoscedastic errors, the true DAG uniquely minimizes this score function. Asa side corollary, we establish that the DAG structure of a linear SEM is identiﬁable when-ever the additive errors are homoscedastic, which generalizes a recent result derived onlyfor Gaussian variables (Peters and B¨uhlmann, 2013). In addition, our result covers caseswith Gaussian and non-Gaussian errors, whereas Shimizu et al. (2011) require all errors tobe non-Gaussian (see Section 4.2). A similar result is implicitly contained under some as-sumptions in van de Geer and B¨uhlmann (2013), but we provide a more general statementand additionally quantify a regime where the errors may exhibit a certain degree of het-eroscedasticity. Thus, when errors are not too heteroscedastic, the much more complicatedICA algorithm (Shimizu et al., 2006, 2011) may be replaced by a simple scoring methodusing squared ℓ -loss.On the statistical side, we show that our method produces consistent estimates of thetrue DAG by invoking results from high-dimensional statistics. We note that our the-oretical results only require a condition on the gap between squared ℓ -scores for vari-ous DAGs in the restricted search space and eigenvalue conditions on the true covari-ance matrix, which is much weaker than the restrictive beta-min condition from previ-ous work (van de Geer and B¨uhlmann, 2013). Furthermore, the size of the gap is not re-quired to scale linearly with the number of nodes in the graph, unlike similar conditionsin van de Geer and B¨uhlmann (2013) and Peters and B¨uhlmann (2013), leading to gen-uinely high-dimensional results. Although the precise size of the gap relies heavily on thestructure of the true DAG, we include several examples giving intuition for when our con-dition could be expected to hold (see Sections 4.4 and 5.2 below). Finally, since inversecovariance matrix estimation and computing scores based on linear regression are both eas- oh and B¨uhlmann ily modiﬁed to deal with systematically corrupted data (Loh and Wainwright, 2012), weshow that our methods are also applicable for learning the DAG structure of a linear SEMwhen data are observed subject to corruptions such as missing data and additive noise.The remainder of the paper is organized as follows: In Section 2, we review the generaltheory of probabilistic graphical models and linear SEMs. Section 3 describes our results onthe relationship between the inverse covariance matrix and conditional independence graphof a linear SEM. In Section 4, we discuss the use of the squared ℓ -loss for scoring candidateDAGs. Section 5 establishes results for the statistical consistency of our proposed inferencealgorithms and explores the gap condition for various graphs. Finally, Section 6 describeshow dynamic programming may be used to identify the optimal DAG in linear time, whenthe moralized graph has bounded treewidth. Proofs of supporting results are contained inthe Appendix.

2. Background

We begin by reviewing some basic background material and introducing notation for thegraph estimation problems studied in this paper.

In this section, we brieﬂy review the theory of directed and undirected graphical models,also known as conditional independence graphs (CIGs). For a more in-depth exposition,see Lauritzen (1996) or Koller and Friedman (2009) and the references cited therein.

Consider a probability distribution q ( x , . . . , x p ) and an undirected graph G = ( V, E ), where V = { , . . . , p } and E ⊆ V × V . We say that G is a conditional independence graph (CIG)for q if the following Markov condition holds: For all disjoint triples (

A, B, S ) ⊆ V suchthat S separates A from B in G , we have X A ⊥⊥ X B | X S . Here, X C := { X j : j ∈ C } forany subset C ⊆ V . We also say that G represents the distribution q .By the well-known Hammersley-Cliﬀord theorem, if q is a strictly positive distribution(i.e., q ( x , . . . , x p ) > x , . . . , x p )), then G is a CIG for q if and only if we may write q ( x , . . . , x p ) = Y C ∈C ψ C ( x C ) , (1)for some potential functions { ψ C : C ∈ C} deﬁned over the set of cliques C of G . Inparticular, note that the complete graph on p nodes is always a CIG for q , but CIGs withfewer edges often exist. Changing notation slightly, consider a directed graph G = ( V, E ), where we now distinguishbetween edges ( j, k ) and ( k, j ). We say that G is a directed acyclic graph (DAG) if thereare no directed paths starting and ending at the same node. For each node j ∈ V , letPa( j ) := { k ∈ V : ( k, j ) ∈ E } denote the parent set of j , where we sometimes write Pa G ( j ) inear SEMs via inverse covariance estimation to emphasize the dependence on G . A DAG G represents a distribution q ( x , . . . , x p ) if q factorizes as q ( x , . . . , x p ) ∝ p Y j =1 q ( x j | q Pa( j ) ) . (2)Finally, a permutation π of the vertex set V = { , . . . , p } is a topological order for G if π ( j ) < π ( k ) whenever ( j, k ) ∈ E . Such a topological order exists for any DAG, but it maynot be unique. The factorization (2) implies that X j ⊥⊥ X ν ( j ) | X Pa( j ) for all j , where ν ( j ) := { k ∈ V : ( j, k ) E and k Pa( j ) } is the set of all nondescendants of j excludingits parents.Given a DAG G , we may form the moralized graph M ( G ) by fully connecting all nodeswithin each parent set Pa( j ) and dropping the orientations of directed edges. Note thatmoralization is a purely graph-theoretic operation that transforms a directed graph into anundirected graph. However, if the DAG G represents a distribution q , then M ( G ) is alsoa CIG for q . This is because each set { j } ∪ Pa( j ) forms a clique C j in M ( G ), and we maydeﬁne the potential functions ψ C j ( x C j ) := q ( x j | q Pa( j ) ) to obtain the factorization (1) fromthe factorization (2).Finally, we deﬁne the skeleton of a DAG G to be the undirected graph formed bydropping orientations of edges in G . Note that the edge set of the skeleton is a subset of theedge set of the moralized graph, but the latter set is generally much larger. The skeleton isnot in general a CIG. We now specialize to the framework of linear structural equation models.We say that a random vector X = ( X , . . . , X p ) ∈ R p follows a linear structural equationmodel (SEM) if X = B T X + ǫ, (3)where B is a strictly upper triangular matrix known as the autoregression matrix . Weassume E[ X ] = E[ ǫ ] = 0 and ǫ j ⊥⊥ ( X , . . . , X j − ) for all j .In particular, observe that the DAG G with vertex set V = { , . . . , p } and edge set E = { ( j, k ) : B jk = 0 } represents the joint distribution q on X . Indeed, equation (3)implies that q ( X j | X , . . . , X j − ) = q ( X j | X Pa G ( j ) ) , so we may factorize q ( X , . . . , X p ) = p Y j =1 q ( X j | X , . . . , X j − ) = p Y j =1 q ( X j | X Pa G ( j ) ) . Given samples { X i } ni =1 , our goal is to infer the unknown matrix B , from which we mayrecover G (or vice versa).

3. Moralized graphs and inverse covariance matrices

In this section, we describe our main result concerning inverse covariance matrices of linearSEMs. It generalizes a result for multivariate Gaussians, and states that the inverse covari- oh and B¨uhlmann ance matrix of the joint distribution of a linear SEM reﬂects the structure of a conditionalindependence graph.We begin by noting that E[ X j | X , . . . , X j − ] = b Tj X, where b j is the j th column of B , and b j = (cid:16) Σ j, j − (cid:0) Σ j − , j − (cid:1) − , , . . . , (cid:17) T . Here, Σ := cov[ X ]. We call b Tj X the best linear predictor for X j amongst linear combinationsof { X , . . . , X j − } . Deﬁning Ω := cov[ ǫ ] and Θ := Σ − , we then have the following lemma,proved in Appendix C.1: Lemma 1

The matrix of error covariances is diagonal:

Ω = diag( σ , . . . , σ p ) for some σ i > . The entries of Θ are given by Θ jk = − σ − k B jk + X ℓ>k σ − ℓ B jℓ B kℓ , ∀ j < k, (4)Θ jj = σ − j + X ℓ>j σ − ℓ B jℓ , ∀ j. (5)In particular, equation (4) has an important implication for causal inference, which westate in the following theorem. Recalling the notation of Section 2.1.2, the graph M ( G )denotes the moralized DAG. Theorem 2

Suppose X is generated from the linear structural equation model (3) . Then Θ reﬂects the graph structure of the moralized DAG; i.e., for j = k , we have Θ jk = 0 if ( j, k ) is not an edge in M ( G ) . Proof

Suppose j = k and ( j, k ) is not an edge in M ( G ), and assume without loss ofgenerality that j < k . Certainly, ( j, k ) E , implying that B jk = 0. Furthermore, j and k cannot share a common child, or else ( j, k ) would be an edge in M ( G ). This implies that ei-ther B jℓ = 0 or B kℓ = 0 for each ℓ > k . The desired result then follows from equation (4).In the results to follow, we will assume that the converse of Theorem 2 holds, as well.This is stated in the following Assumption: Assumption 1

Let ( B, Ω) be the matrices of the underlying linear SEM. For every j < k ,we have − σ − k B jk + X ℓ>k σ − ℓ B jℓ B kℓ = 0 only if B jk = 0 and B jℓ B kℓ = 0 for all ℓ > k . Combined with Theorem 2, Assumption 1 implies that Θ jk = 0 if and only if ( j, k ) is notan edge in M ( G ). (Since Θ ≻

0, the diagonal entries of Θ are always strictly positive.)Note that when the nonzero entries of B are independently sampled continuous randomvariables, Assumption 1 holds for all choices of B except on a set of Lebesgue measure zero. inear SEMs via inverse covariance estimation Remark 3

Note that Theorem 2 may be viewed as an extension of the canonical result forGaussian graphical models. Indeed, a multivariate Gaussian distribution may be writtenas a linear SEM with respect to any permutation order π of the variables, giving rise to aDAG G π . Theorem 2 states that supp(Θ) is always a subset of the edge set of M ( G π ) . As-sumption 1 is a type of faithfulness assumption (Koller and Friedman, 2009; Spirtes et al.,2000). By selecting diﬀerent topological orders π , one may then derive the usual resultthat X j ⊥⊥ X k | X \{ j,k } if and only if Θ jk = 0 , for the Gaussian setting. Note that thisconditional independence assertion may not always hold for linear SEMs, however, sincenon-Gaussian distributions are not necessarily expressible as a linear SEM with respect toan arbitrary permutation order. Indeed, we only require Assumption 1 to hold with respectto a single (ﬁxed) order.

4. Score functions for DAGs

Having established a method for reducing the search space of DAGs based on estimatingthe moralized graph, we now move to the more general problem of scoring candidate DAGs.As before, we assume the setting of a linear SEM.Parametric maximum likelihood is often used as a score function for statistical estimationof DAG structure, since it enjoys the nice property that the population-level version ismaximized only under a correct parametrization of the model class. This follows from therelationship between maximum likelihood and KL divergence:arg max θ E θ [log p θ ( X )] = arg min θ E θ (cid:20) log (cid:18) p θ ( X ) p θ ( X ) (cid:19)(cid:21) = arg min θ D KL ( p θ k p θ ) , and the latter quantity is minimized exactly when p θ ≡ p θ , almost everywhere. If themodel is identiﬁable, this happens only if θ = θ .However, such maximum likelihood methods presuppose a ﬁxed parametrization for themodel class. In the case of linear SEMs, this translates into an appropriate parametrizationof the error vector ǫ . For comparison, note that minimizing the squared ℓ -error for ordi-nary linear regression may be viewed as a maximum likelihood approach when errors areGaussian, but the ℓ -minimizer is still statistically consistent for estimation of the regres-sion vector when errors are not Gaussian. When our goal is recovery of the autoregressionmatrix B of the DAG, it is therefore natural to ask whether squared ℓ -error could be usedin place of maximum likelihood as an appropriate metric for evaluating DAGs.We will show that in settings when the noise variances { σ j } pj =1 are speciﬁed up to aconstant (e.g., homoscedastic error), the answer is aﬃrmative. In such cases, the true DAGuniquely minimizes the ℓ -loss. As a side result, we also show that the true linear SEM isidentiﬁable. Remark 4

Nowzohour and B¨uhlmann (2013) study the use of nonparametric maximumlikelihood methods for scoring candidate DAGs. We remark that such methods could alsobe combined with the framework of Sections 3 and 6 to select the optimal DAG for linearSEMs with nonparametric error distributions: First, estimate the moralized graph via theinverse covariance matrix, and then ﬁnd the DAG with minimal score using a method suchas dynamic programming. Similar statistical guarantees would hold in that case, with para-metric rates replaced by nonparametric rates. However, our results in this section imply that oh and B¨uhlmann in settings where the error variances are known or may be estimated accurately, the muchsimpler method of squared ℓ -loss may be used in place of a more complicated nonparametricapproach. ℓ -loss Suppose X is drawn from a linear SEM (3), where we now use B to denote the trueautoregression matrix and Ω to denote the true error covariance matrix. For a ﬁxeddiagonal matrix Ω = diag( σ , . . . , σ p ) and a candidate matrix B with columns { b j } pj =1 ,deﬁne the score of B with respect to Ω according toscore Ω ( B ) = E h k Ω − / ( I − B ) T X k i = p X j =1 σ j · E[( X j − b Tj X ) ] . (6)This is a weighted squared ℓ -loss, where the prediction error for the j th coordinate isweighted by the diagonal entry σ j coming from Ω, and expectations are taken with respectto the true distribution on X .It is instructive to compare the score function (6) to the usual parametric maximumlikelihood when X ∼ N (0 , Σ). For a distribution parametrized by the pair ( B, Ω), theinverse covariance matrix is Θ = ( I − B )Ω − ( I − B ) T , so the expected log likelihood isE X ∼ N (0 , Σ) [log p B, Ω ( X )] = − tr[( I − B )Ω − ( I − B ) T Σ] + log det[( I − B )Ω − ( I − B ) T ]= − tr[( I − B )Ω − ( I − B ) T Σ] + log det(Ω − )= − score Ω ( B ) + log det(Ω − ) . Hence, minimizing the score over B for a ﬁxed Ω is identical to maximizing the likelihood.For non-Gaussians, however, the convenient relationship between minimum score and max-imum likelihood no longer holds.Now let D denote the class of DAGs. For G ∈ D , deﬁne the score of G to bescore Ω ( G ) := min B ∈U G { score Ω ( B ) } , (7)where U G := { B ∈ R p × p : B jk = 0 when ( j, k ) E ( G ) } is the set of matrices that are consistent with the structure of G . Remark 5

Examining the form of the score function (6) , we see that if { Pa G ( j ) } pj =1 de-notes the parent sets of nodes in G , then the matrix B G := arg min B ∈U G { score Ω ( B ) } is unique, and the columns of B G are equal to the coeﬃcients of the best linear predictor of X j regressed upon X Pa G ( j ) . Furthermore, the value of B G does not depend on Ω . The following lemma relates the score of the underlying DAG G to the score of thetrue autoregression matrix B . In fact, the score of any DAG containing G has the samescore. The proof is contained in Appendix C.2. inear SEMs via inverse covariance estimation Lemma 6

Suppose X follows a linear SEM with autoregression matrix B , and let G denote the underlying DAG. Consider any G ∈ D such that G ⊆ G . Then for any diagonalweight matrix Ω , we have score Ω ( G ) = score Ω ( B ) , and B is the unique minimizer of score Ω ( B ) over U G . We now turn to the main theorem of this section, in which we consider the problem ofminimizing score Ω ( B ) with respect to all matrices B that are permutation similar to uppertriangular matrices. Such a result is needed to validate our choice of score function, sincewhen the DAG structure is not known a priori, the space of possible autoregression matricesmust include all U := S G ∈D U G . Note that U may be equivalently deﬁned as the set of allmatrices that are permutation similar to upper triangular matrices. We have the followingvital result: Theorem 7

Given a linear SEM (3) with error covariance matrix α Ω and autoregressionmatrix B , where α > , we have score α Ω ( B ) ≥ score α Ω ( B ) = p, ∀ B ∈ U , (8) with equality if and only if B = B . The proof of Theorem 7, which is based on matrix algebra, is contained in Section 4.5. Inparticular, Theorem 7 implies that the squared ℓ -loss function (6) is indeed an appropriatemeasure of model ﬁt when the components are correctly weighted by the diagonal entriesof Ω .Note, however, that Theorem 7 requires the score to be taken with respect to (a multipleof) the true error covariance matrix Ω . The following example gives a cautionary messagethat if the weights Ω are chosen incorrectly, minimizing score Ω ( B ) may produce a structurethat is inconsistent with the true model. Example 1

Suppose ( X , X ) is distributed according to the following linear SEM: X = ǫ , and X = − X ǫ , so the autoregression matrix is given by B = (cid:18) − (cid:19) . Let Ω = (cid:18) (cid:19) . Consider B = (cid:18) − (cid:19) . Then score I ( B ) < score I ( B ) , (9) so using squared ℓ -loss weighted by the identity will select an inappropriate model. Proof

To verify equation (9), we ﬁrst computeΣ = ( I − B ) − T Ω ( I − B ) = (cid:18) − −

12 12 (cid:19) . oh and B¨uhlmann Then E[ k X − B T X k ] = tr (cid:2) ( I − B ) T Σ( I − B ) (cid:3) = tr (cid:20)(cid:18) / / (cid:19)(cid:21) = 1 , E[ k X − B T X k ] = tr (cid:2) ( I − B ) T Σ( I − B ) (cid:3) = tr (cid:20)(cid:18) / (cid:19)(cid:21) = 54 , implying inequality (9). Theorem 7 also has a useful consequence in terms of identiﬁability of a linear SEM, whichwe state in the following corollary:

Corollary 8

Consider a ﬁxed diagonal covariance matrix Ω , and consider the class oflinear SEMs parametrized by the pair ( B, α Ω ) , where B ∈ U and α > is a scale factor.Then the true model ( B , α Ω ) is identiﬁable. In particular, the class of homoscedasticlinear SEMs is identiﬁable. Proof

By Theorem 7, the matrix B is the unique minimizer of score Ω ( B ). Since α · (Ω ) = var[ X ], the scale factor α is also uniquely identiﬁable. The statementabout homoscedasticity follows by taking Ω = I .Corollary 8 should be viewed in comparison to previous results in the literature regardingidentiﬁability of linear SEMs. Theorem 1 of Peters and B¨uhlmann (2013) states that when X is Gaussian and ǫ is an i.i.d. Gaussian vector with cov[ ǫ ] = α Ω , the model is identiﬁable.Indeed, our Corollary 8 implies that result as a special case, but it does not impose anyadditional conditions concerning Gaussianity. Shimizu et al. (2006) establish identiﬁabilityof a linear SEM when ǫ is a vector of independent, non-Gaussian errors, by reducing toICA, but our result does not require errors to be non-Gaussian.The signiﬁcance of Corollary 8 is that it supplies an elegant proof showing that the modelis still identiﬁable even in the presence of both Gaussian and non-Gaussian components,provided the error variances are speciﬁed up to a scalar multiple. Since any multivariateGaussian distribution may be written as a linear SEM with respect to an arbitrary order-ing, some constraint such as variance scaling or non-Gaussianity is necessary in order toguarantee identiﬁability. Theorem 7 implies that when the diagonal variances of Ω are known up to a scalar factor,the weighted ℓ -loss (6) may be used as a score function for linear SEMs. Example 1 showsthat when Ω is misspeciﬁed, we may have B arg min B ∈U { score Ω ( B ) } . In this section,we further study the eﬀect when Ω is misspeciﬁed. Intuitively, provided Ω is close enoughto Ω (or a multiple thereof), minimizing score Ω ( B ) with respect to B should still yield thecorrect B . inear SEMs via inverse covariance estimation Consider an arbitrary diagonal weight matrix Ω . We ﬁrst provide bounds on the ratiobetween entries of Ω and Ω which ensure that B = arg min B ∈U { score Ω ( B ) } , even thoughthe model is misspeciﬁed. Let a max := λ max (Ω Ω − ) , and a min := λ min (Ω Ω − ) , denote the maximum and minimum ratios between corresponding diagonal entries of Ω and Ω . Now deﬁne the additive gap between the score of G and the next best DAG, givenby ξ := min G ∈D , G G { score Ω ( G ) − score Ω ( G ) } = min G ∈D ,G G { score Ω ( G ) } − p. (10)By Theorem 7, we know that ξ >

0. The following theorem provides a suﬃcient conditionfor correct model selection in terms of the gap ξ and the ratio a max a min , which are both invariantto the scale factor α . It is a measure of robustness for how roughly the entries of Ω maybe approximated and still produce B as the unique minimizer. The proof of the theoremis contained in Appendix C.3. Theorem 9

Suppose a max a min ≤ ξp . (11) Then B ∈ arg min B ∈U { score Ω ( B ) } . If inequality (11) is strict, then B is the uniqueminimizer of score Ω ( B ) . Remark 10

Theorem 9 provides an error allowance concerning the accuracy to which wemay specify the error covariances and still recover the correct autoregression matrix B froman improperly weighted score function. In the case Ω = α Ω , we have a max = a min = 1 , sothe condition (11) is always strictly satisﬁed, which is consistent with our earlier result inTheorem 7 that B = arg min B ∈U { score α Ω ( B ) } . Naturally, the error tolerance speciﬁed by Theorem 9 is a function of the gap ξ betweenthe true DAG G and the next best candidate: If ξ is larger, the score is more robust tomisspeciﬁcation of the weights Ω. Note that if we restrict our search space from the full setof DAGs D to some smaller space D ′ , so B ∈ S G ∈D ′ U G , we may restate the condition inTheorem 9 in terms of the gap ξ ( D ′ ) := min G ∈D ′ , G G { score Ω ( G ) − score Ω ( G ) } , (12)which may be considerably larger than ξ when D ′ is much smaller than D (cf. equation (21)and Section 5.2 on weakening the gap condition below).Specializing to the case when Ω = I , we may interpret Theorem 9 as providing a windowof variances around which we may treat a heteroscedastic model as homoscedastic, and usethe simple (unweighted) squared ℓ -score to recover the correct model. See Lemma 11 inthe next section for a concrete example. In this section, we develop examples illustrating the gap ξ introduced in Section 4.3. Westudy two cases, involving two and three variables, respectively. oh and B¨uhlmann b b (a) (b) Figure 1:

Two-variable DAG. (a) The forward model. (b) The backward model.

We ﬁrst consider the simplest example with a two-variable directed graph. Suppose theforward model is deﬁned by B = (cid:18) b (cid:19) , Ω = (cid:18) d d (cid:19) , and consider the backward matrix deﬁned by the autoregression matrix B = (cid:18) b (cid:19) . (13)The forward and backward models are illustrated in Figure 1.A straightforward calculation shows thatscore Ω ( B ) = 2 + b b + (cid:18) b d d − b d d (cid:19) , which is minimized for b = b b + d d , implying that ξ = min b { score Ω ( B ) − score Ω ( B ) } = b d d + b d d . (14)We see that the gap ξ grows with the strength of the true edge b , when | b | >

1, and issymmetric with respect to the sign of b . The gap also grows with the magnitude of theratio d d .To gain intuition for the interplay between b and d d , we derive the following lemma, acorollary of Theorem 9 specialized to the case of the two-variable DAG: Lemma 11

Consider the two-variable DAG deﬁned by equation (13) . Let Ω = I anddeﬁne r := d d . Suppose the following conditions hold: b ≥ ( r (cid:16) ( r −

1) + √ r − (cid:17) , if r ≥ , (1 − r ) + √ − r , if r ≤ . (15) Then B = arg min B ∈U { score Ω ( B ) } ; i.e., B is the unique minimizer of the score functionunder the unweighted squared- ℓ loss. Lemma 11 is proved in Appendix C.4. inear SEMs via inverse covariance estimation b b Figure 2:

Three-variable DAG with v -structure. Remark 12

Note that the two right-hand expressions in inequality (15) are similar, al-though the expression in the case r ≥ contains an extra factor of r , so the suﬃcientcondition is stronger. Both lower bounds in equation (15) increase with | r − | , which agreeswith intuition: If the true model is more non-homoscedastic, the strength of the true edgemust be stronger in order for the unweighted squared- ℓ score to correctly identify the model.When r = 1 , we have the vacuous condition b ≥ , since Ω = α Ω and the variances arecorrectly speciﬁed, so Theorem 7 implies B = arg min B ∈U { score Ω ( B ) } for any choice of b . v -structure We now examine a three-variable graph. Suppose the actual graph involves a v -structure,as depicted in Figure 2, and is parametrized by the matrices B =  b b  , Ω =  d d

00 0 d  . (16)We have the following lemma, proved in Appendix C.5: Lemma 13

Consider the three-variable DAG characterized by Figure 2 and equations (16) .The gap ξ deﬁned by equation (10) is given by ξ = min  b d d + b d d , b d d + b d d  . A key reduction in the proof of Lemma 13 is to note that we only need to consider arelatively small number of DAGs, given by Figure 3 in Appendix C.5. Indeed, for G ⊆ G ,we have score Ω ( G ) ≤ score Ω ( G ), so it suﬃces to consider maximal elements in the posetof DAGs not containing the true DAG G . Remark 14

Note that the form of the gap in Lemma 13 is very similar to the form forthe two-variable model, and the individual ratios scale with the strength of the edge and theratio of the corresponding error variances. Indeed, we could derive a version of Lemma 11for the three-variable model, giving lower bounds on the edge strengths b and b thatguarantee the accuracy of the unweighted squared ℓ -loss; however, the conditions wouldbe more complicated. It is interesting to note from our calculations in Appendix C.5 thatthe gap between models accumulates according to the number of edge reversals from themisspeciﬁed model: Reversing the directions of edges (2 , and (1 , in succession leads to oh and B¨uhlmann an additional term in the expressions for ξ and ξ in equations (44) and (45) below. Wewill revisit these observations in Section 5.2, where we deﬁne a version of the gap functionrescaled by the number of nodes that diﬀer in their parents sets. First note that for a constant α >

0, we havescore α Ω ( B ) = 1 α · score Ω ( B ) , so minimizing score α Ω ( B ) is equivalent to minimizing score Ω ( B ). Hence, it is suﬃcient toprove the statement for α = 1.Recalling equation (40), we may writescore Ω ( B ) = E B [ k Ω − / ( I − B ) T X k ]= tr h Ω − / ( I − B ) T · cov B [ X ] · ( I − B )Ω − / i = tr h Ω − / ( I − B ) T · ( I − B ) − T Ω ( I − B ) − · ( I − B )Ω − / i . Now note that ( I − B )Ω − / = Ω − / ( I − Ω / B Ω − / ) := Ω − / ( I − e B ) , Ω / ( I − B ) − = ( I − Ω / B Ω − / ) − Ω / := ( I − e B ) − Ω / , where e B, e B ∈ U . Hence, we may rewritescore Ω ( B ) = tr h ( I − e B ) T Ω − / · Ω / ( I − e B ) − T · ( I − e B ) − Ω / · Ω − / ( I − e B ) i = tr h ( I − e B ) T ( I − e B ) − T ( I − e B ) − ( I − e B ) i = tr h ( I − e B )( I − e B ) T ( I − e B ) − T ( I − e B ) − i . Since e B, e B ∈ U , the matrices I − e B and I − e B are both permutation similar to lowertriangular matrices with 1’s on the diagonal. Hence, Lemma 27 in Appendix B impliesscore Ω ( B ) ≥ p, with equality if and only if I − e B = I − e B , or equivalently, B = B , as claimed.

5. Consequences for statistical estimation

The population-level results in Theorems 2 and 7 provide a natural avenue for estimatingthe DAG of a linear SEM from data. In this section, we outline how the true DAG may beestimated in the presence of fully-observed or systematically corrupted data. Our method isapplicable also in the high-dimensional setting, assuming the moralized DAG is suﬃcientlysparse.Our inference algorithm consists of two main components: inear SEMs via inverse covariance estimation

1. Estimate the moralized DAG M ( G ) using the inverse covariance matrix of X .2. Search through the space of DAGs consistent with M ( G ), and ﬁnd the DAG thatminimizes score Ω ( B ).Theorem 2 and Assumption 1 ensure that for almost every choice of autoregressionmatrix B , the support of the true inverse covariance matrix Θ exactly corresponds tothe edge set of the moralized graph. Theorem 7 ensures that when the weight matrix Ω ischosen appropriately, B will be the unique minimizer of score Ω ( B ). We now present concrete statistical guarantees for the correctness of our algorithm in theusual setting when { x i } ni =1 are fully-observed and i.i.d. Recall that a random variable X is sub-Gaussian with parameter σ ifE[exp( λ ( X − E[ X ]))] ≤ exp (cid:18) σ λ (cid:19) , ∀ λ ∈ R . If X ∈ R p is a random vector, it is sub-Gaussian with parameter σ if v T X is a sub-Gaussianrandom variable with parameter σ for all unit vectors v ∈ R p . We ﬁrst consider the problem of inferring Θ . LetΘ min0 := min j,k (cid:8) | (Θ ) jk | : (Θ ) jk = 0 (cid:9) denote the magnitude of the minimum nonzero element of Θ . We consider the followingtwo scenarios: Low-dimensional setting. If n ≥ p , the sample covariance matrix b Σ = n P ni =1 x i x Ti isinvertible, and we use the estimator b Θ = ( b Σ) − . We have the following lemma, which follows from standard bounds in random matrix theory:

Lemma 15

Suppose the x i ’s are i.i.d. sub-Gaussian vectors with parameter σ . With prob-ability at least − c exp( − c p ) , we have k b Θ − Θ k max ≤ c σ r pn , and thresholding b Θ at level τ = c σ q pn succeeds in recovering supp(Θ ) , if Θ min0 > τ . For the proof, see Appendix D.1. Here, we use to k · k max denote the elementwise ℓ ∞ -normof a matrix. oh and B¨uhlmann High-dimensional setting. If p > n , we assume each row of the true inverse covariancematrix Θ is d -sparse. Then we use the graphical Lasso: b Θ ∈ arg min Θ (cid:23)  tr(Θ b Σ) − log det(Θ) + λ X j = k | Θ jk |  . (17)Standard results (Ravikumar et al., 2011) establish the statistical consistency of the graphi-cal Lasso (17) as an estimator for the inverse covariance matrix in the setting of sub-Gaussianobservations; consequently, we omit the proof of the following lemma. Lemma 16

Suppose the x i ’s are i.i.d. sub-Gaussian vectors with parameter σ . Supposethe sample size satisﬁes n ≥ Cd log p . With probability at least − c exp( − c log p ) , wehave k b Θ − Θ k max ≤ c σ r log pn , and thresholding b Θ at level τ = c σ q log pn succeeds in recovering supp(Θ ) , if Θ min0 > τ . Alternatively, we may perform nodewise regression with the ordinary Lasso (Meinshausen and B¨uhlmann,2006) to recover the support of Θ , with similar rates for statistical consistency. Moving on to the second step of the algorithm, we need to estimate the score functionsscore Ω ( B ) of candidate DAGs and choose the minimally scoring candidate. In this section,we focus on methods for estimating an empirical version of the score function and deriverates for statistical estimation under certain models. If the space of candidate DAGs issuﬃciently small, we may evaluate the empirical score function for every candidate DAGand select the optimum. In Section 6, we describe computationally eﬃcient proceduresbased on dynamic programming to choose the optimal DAG when the candidate space istoo large for naive search.The input of our algorithm is the sparsely estimated inverse covariance matrix b Θ fromSection 5.1.1. For a matrix Θ, deﬁne the candidate neighborhood sets N Θ ( j ) := { k : k = j and Θ jk = 0 } , ∀ j, and let D Θ := { G ∈ D : Pa G ( j ) ⊆ N Θ ( j ) , ∀ j } denote the set of DAGs with skeleton contained in the graph deﬁned by supp(Θ). ByTheorem 2 and Assumption 1, we have G ∈ D Θ , so if supp( b Θ) ⊇ supp(Θ ), which occurswith high probability under the conditions of Section 5.1.1, it suﬃces to search over thereduced DAG space D b Θ . Remark 17

In fact, we could reduce the search space even further to only include DAGswith moralized graph equal to the undirected graph deﬁned by supp(Θ) . The dynamic pro-gramming algorithm to be described in Section 6 only requires as input a superset of theskeleton; for alternative versions of the dynamic programming algorithm taking as input asuperset of the moralized graph, we would indeed restrict D Θ to DAGs with the correct moralstructure. inear SEMs via inverse covariance estimation We now consider an arbitrary d -sparse matrix Θ, with d ≤ n , and take G ∈ D Θ . ByRemark 5, we have score Ω ( G ) = p X j =1 f σ j (Pa G ( j )) , (18)where f σ j ( S ) := 1 σ j · E[( x j − b Tj x S ) ] , and b Tj x S is the best linear predictor for x j regressed upon x S . In order to estimatescore Ω ( G ), we use the empirical functions b f σ j ( S ) := 1 σ j · n n X i =1 ( x ij − x Ti,S b b j ) = 1 σ j · n k X j − X S b b j k , (19)where b b j := ( X TS X S ) − X TS X j is the usual ordinary least squares solution for linear regression of X j upon X S . We willtake S ⊆ N Θ ( j ), so since | N Θ ( j ) | ≤ d ≤ n , the matrix X TS X S is invertible w.h.p. Thefollowing lemma, proved in Appendix D.2, provides rates of convergence for the empiricalscore function: Lemma 18

Suppose the x i ’s are i.i.d. sub-Gaussian vectors with parameter σ . Suppose d ≤ n is a parameter such that | N Θ ( j ) | ≤ d for all j . Then | b f σ j ( S ) − f σ j ( S ) | ≤ c σ σ j r log pn , ∀ j and S ⊆ N Θ ( j ) , (20) with probability at least − c exp( − c log p ) . In particular, we have the following result, proved in Appendix D.3, which provides asuﬃcient condition for the empirical score functions to succeed in selecting the true DAG.Here, ξ Ω ( D Θ ) := min G ∈D Θ ,G G { score Ω ( G ) − score Ω ( G ) } (21)is the gap between G and the next best DAG in D Θ . Note that the expression (21)is reminiscent of the expression (12) deﬁned in Section 4.3, but we now allow Ω to bearbitrary. Lemma 19

Suppose inequality (20) holds, and suppose c σ r log pn · p X j =1 σ j < ξ Ω ( D Θ )2 . (22) Then [ score Ω ( G ) < [ score Ω ( G ) , ∀ G ∈ D Θ : G G . (23) oh and B¨uhlmann Remark 20

Lemma 19 does not explicitly assume that Ω is equal to Ω , the true ma-trix of error variances. However, inequality (22) can only be satisﬁed when ξ Ω ( D G ) > ;hence, Ω should be chosen such that G = arg min G ∈D Θ ,G G { score Ω ( G ) } . As discussed inSection 4.3, this condition holds for a wider range of choices for Ω . Note that the conclusion (23) in Lemma 19 is not quite the same as the condition G = arg min G ∈D Θ ,G G (cid:8) [ score Ω ( G ) (cid:9) , (24)which is what we would need for exact recovery of our score-minimizing algorithm. Theissue is that score Ω ( G ) is equal for all G ⊇ G ; however the empirical scores [ score Ω ( G ) maydiﬀer among this class, so equation (24) may not be satisﬁed. However, it is easily seenfrom the proof of Lemma 19 that in fact,arg min G ∈D Θ (cid:8) [ score Ω ( G ) (cid:9) ⊆ { G ∈ D Θ : G ⊇ G } . (25)By applying a thresholding procedure to the empirical score minimizer b G ⊇ G selected byour algorithm, we could then recover the true G . In other words, since Pa G ( j ) ⊆ Pa b G ( j )for each j , we could use standard sparse regression techniques to recover the parent set ofeach node in the true DAG.To gain some intuition for the condition (22), consider the case when σ j = 1 for all j .Then the condition becomes c σ r log pn < ξ ( D Θ )2 p . (26)If ξ ( D Θ ) = Ω(1), which might be expected based on our calculations in Section 4.4, werequire n ≥ Cp log p in order to guarantee statistical consistency, which is not a trulyhigh-dimensional result. On the other hand, if ξ ( D Θ ) = Ω( p ), as is assumed in similar workon score-based DAG learning (van de Geer and B¨uhlmann, 2013; B¨uhlmann et al., 2013),our method is consistent provided log pn →

0. In Section 5.2, we relax the condition (26) toa slightly weaker condition that is more likely to hold in settings of interest.

Motivated by our comments from the previous section, we establish a suﬃcient condition forstatistical consistency that is slightly weaker than the condition (22), which still guaranteesthat equation (25) holds.For two DAGs

G, G ′ ∈ D , deﬁne H ( G, G ′ ) := { j : Pa G ( j ) = Pa G ′ ( j ) } to be the set of nodes on which the parent sets diﬀer between graphs G and G ′ , and deﬁnethe ratio γ Ω ( G, G ′ ) := score Ω ( G ) − score Ω ( G ′ ) | H ( G, G ′ ) | , a rescaled version of the gap between the score functions. Consider the following condition: inear SEMs via inverse covariance estimation Assumption 2

There exists ξ ′ > such that γ Ω ( G ) := min G ∈D Θ ,G G (cid:26) max G ⊇ G { γ Ω ( G, G ) } (cid:27) ≥ ξ ′ . (27)Note that in addition to minimizing over DAGs in the class D Θ , the expression (27) deﬁnedin Assumption 2 takes an inner maximization over DAGs containing G . As established inLemma 6, we have score Ω ( G ) = score Ω ( G ) whenever G ⊆ G . However, | H ( G, G ) | maybe appreciably diﬀerent from | H ( G, G ) | , and we are only interested in computing the gapratio between a DAG G G and the closest DAG containing G .We then have the following result, proved in Appendix D.4: Lemma 21

Under Assumption 2, suppose | b f σ j ( S ) − f σ j ( S ) | ≤ ξ ′ , ∀ j and S ⊆ N Θ ( j ) . (28) Then the containment (25) holds.

Combining with Lemma 18, we have the following corollary:

Corollary 22

Suppose the x i ’s are i.i.d. sub-Gaussian with parameter σ , and | N Θ ( j ) | ≤ d for all j . Also suppose Assumption 2 holds. Then with probability at least − c exp( − c log p ) ,condition (25) is satisﬁed. We now turn to the question of what values of ξ ′ might be expected to give condition (27)for various DAGs. Certainly, we have γ Ω ( G, G ′ ) ≥ score Ω ( G ) − score Ω ( G ′ ) p , so the condition holds when p · ξ ′ < ξ ( D Θ ) . However, for ξ ′ = O ( ξ ( D Θ ) /p ), Corollary 22 yields a scaling condition similar to inequal-ity (26), which we wish to avoid. As motivated by our computations of the score functionsfor small DAGs (see Remark 14 in Section 4.4), the diﬀerence { score Ω ( G ) − score Ω ( G ) } seems to increase linearly with the number of edge reversals needed to transform G to G . Hence, we might expect γ Ω ( G, G ) to remain roughly constant, rather than decreasinglinearly with p . The following lemma veriﬁes this intuition in a special case. For a reviewof junction tree terminology, see Appendix A.1. Lemma 23

Suppose the moralized graph M ( G ) admits a junction tree representation withonly singleton separator sets. Let C , . . . , C k denote the maximal cliques in M ( G ) , and let { G ℓ } kℓ =1 denote the corresponding restrictions of G to the cliques. Then γ Ω ( G ) ≥ min ≤ ℓ ≤ k γ Ω ( G ℓ ) , where γ Ω ( G ℓ ) := min G ℓ ∈D Θ | Cℓ ,G ℓ G ℓ ( max G ℓ ⊇ G ℓ (cid:26) score Ω ( G ℓ ) − score Ω ( G ℓ ) | H ( G ℓ , G ℓ ) | (cid:27)) is the gap ratio computed over DAGs restricted to clique C ℓ that are consistent with themoralized graph. oh and B¨uhlmann The proof is contained in Appendix D.5.We might expect the gap ratio γ Ω ( G ℓ ) to be a function of the size of the clique. Inparticular, if the treewidth of M ( G ) is bounded by w and we have γ Ω ( G ℓ ) ≥ ξ w for all ℓ ,Lemma 23 implies that γ Ω ( G ) ≥ ξ w , and we only need the parameter ξ ′ appearing in Assumption 2 to be larger than ξ w , ratherthan scaling as the inverse of p . We expect a version of Lemma 23 to hold for graphswith bounded treewidth even when the separator sets have larger cardinality, but a fullgeneralization of Lemma 23 and a more accurate characterization of γ Ω ( G ) for arbitrarygraphs is beyond the scope of this paper. We now describe how our algorithm for DAG structure estimation in linear SEMs may beextended easily to accommodate systematically corrupted data. This refers to the settingwhere we observe noisy surrogates { z i } ni =1 in place of { x i } ni =1 . Two common examplesinclude the following:(a) Additive noise.

We have z i = x i + w i , where w i ⊥⊥ x i is additive noise with knowncovariance Σ w .(b) Missing data.

This is one instance of the more general setting of multiplicative noise.For each 1 ≤ j ≤ p , and independently over coordinates, we have z ij = ( x ij , with probability 1 − α,⋆, with probability α, where the missing data probability α is either estimated or known.We again divide our discussion into two parts: estimating Θ and computing scorefunctions based on corrupted data. Following the observation of Loh and Wainwright (2013), the graphical Lasso (17) may stillbe used to estimate the inverse covariance matrix Θ in the high-dimensional setting, wherewe plug in a suitable estimator b Γ for the covariance matrix Σ = cov( x i ), based on thecorrupted observations { z i } ni =1 . For instance, in the additive noise scenario, we may take b Γ = Z T Zn − Σ w , (29)and in the missing data setting, we may take b Γ = Z T Zn ⊙ M, (30)where ⊙ denotes the Hadamard product and M is the matrix with diagonal entries equalto − α and oﬀ-diagonal entries equal to − α ) . inear SEMs via inverse covariance estimation Assuming conditions such as sub-Gaussianity, the output b Θ of the modiﬁed graphi-cal Lasso (17) is statistically consistent under similar scaling as in the uncorrupted set-ting (Loh and Wainwright, 2013). For instance, in the additive noise setting, where the z i ’sare sub-Gaussian with parameter σ z , Lemma 16 holds with σ replaced by σ z . Analogousresults hold in the low-dimensional setting, when the expressions for b Γ in equations (29)and (30) are invertible with high probability, and we may simply use b Θ = ( b Γ) − . We now describe how to estimate score functions for DAGs based on corrupted data. Byequation (18), this reduces to estimating f σ j ( S ) = 1 σ j · E[( x j − b Tj x S ) ] , for a subset S ⊆ { , . . . , p }\{ j } , with | S | ≤ n . Note that σ j · f σ j ( S ) = Σ jj − b Tj Σ S,j + b Tj Σ SS b j = Σ jj − Σ j,S Σ − SS Σ S,j , since b j = Σ − SS Σ S,j .Let b Γ be the estimator for Σ based on corrupted data used in the graphical Lasso (e.g.,equations (29) and (30)). We then use the estimator e f σ j ( S ) = 1 σ j · (cid:16)b Γ jj − b Γ j,S b Γ − SS b Γ S,j (cid:17) . (31)Note in particular that equation (31) reduces to expression (19) in the fully-observed setting.We establish consistency of the estimator in equation (31), under the following deviationcondition on b Γ: P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)b Γ SS − Σ SS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ σ r dn + t !! ≤ c exp( − c nt ) , for any | S | ≤ d. (32)For instance, such a condition holds in the case of the sub-Gaussian additive noise model(cf. Lemma 29 in Appendix E), with b Γ given by equation (29), where σ = σ z .We have the following result, an extension of Lemma 18 applicable also for corruptedvariables: Lemma 24

Suppose b Γ satisﬁes the deviation condition (32) . Suppose | N Θ ( j ) | ≤ d for all j . Then | e f σ j ( S ) − f σ j ( S ) | ≤ c σ σ j r log pn , ∀ j and S ⊆ N Θ ( j ) , with probability at least − c exp( − c log p ) . The proof is contained in Appendix D.6. In particular, Corollary 22, providing guaranteesfor statistical consistency, also holds. oh and B¨uhlmann

6. Computational considerations

In practice, the main computational bottleneck in inferring the DAG structure comes fromhaving to compute score functions over a large number of DAGs. The simplest approachof searching over all possible permutation orderings of indices gives rise to p ! candidateDAGs, which scales exponentially with p . In this section, we describe how the result ofTheorem 2 provides a general framework for achieving vast computational savings for ﬁndingthe best-scoring DAG when data are generated from a linear SEM. We begin by reviewingexisting methods, and describe how our results may be used in conjunction with dynamicprogramming to produce accurate and eﬃcient DAG learning. Following the literature, we call a score function over DAGs decomposable if it may bewritten as a sum of score functions over individual nodes, each of which is a function ofonly the node and its parents:score( G ) = p X j =1 score j (Pa G ( j )) . Note that we allow the score functions to diﬀer across nodes. Consistent with our earliernotation, the goal is to ﬁnd the DAG G ∈ D that minimizes score( G ).Some common examples of decomposable scores that are used for DAG inference includemaximum likelihood, BDe, BIC, and AIC (Chickering, 1995). By equation (18), the squared ℓ -score is clearly decomposable, and it gives an example where score j diﬀers over nodes.Another interesting example is the nonparametric maximum likelihood, which extends theordinary likelihood method for scoring DAGs (Nowzohour and B¨uhlmann, 2013).Various recent results have focused on methods for optimizing a decomposable scorefunction over the space of candidate DAGs in an eﬃcient manner. Some methods includeexhaustive search (Silander and Myllymaki, 2006), greedy methods (Chickering, 2002), anddynamic programming (Ordyniak and Szeider, 2012; Korhonen and Parviainen, 2013). Wewill focus here on a dynamic programming method that takes as input an undirected graphand outputs the best-scoring DAG with skeleton contained in the input graph. In this section, we detail a method due to Ordyniak and Szeider (2012) that will be usefulfor our purposes. Given an input undirected graph G I and a decomposable score function,the dynamic programming algorithm ﬁnds a DAG with minimal score that has skeletoncontained in G I . Let { N I ( j ) } pj =1 denote the neighborhood sets of G I . The runtime ofthe dynamic programming algorithm is exponential in the treewidth w of G I ; hence, thealgorithm is only tractable for bounded-treewidth graphs.The main steps of the dynamic programming algorithm are as follows. For a reviewof basic terminology of graph theory, including treewidth and tree decompositions, seeAppendix A; for further details and a proof of correctness, see Ordyniak and Szeider (2012).1. Construct a tree decomposition of G I with minimal treewidth. inear SEMs via inverse covariance estimation

2. Construct a nice tree decomposition of the graph. Let χ ( t ) denote the subset of { , . . . , p } associated to a node t in the nice tree decomposition.3. Starting from the leaves of the nice tree decomposition up to the root, compute the record for each node t . The record R ( t ) is the set of tuples ( a, p, s ) corresponding tominimal-scoring DAGs deﬁned on the vertices χ ∗ ( t ) in the subtree attached to t , withskeleton contained in G I . For each such DAG, s is the score, a lists the parent setsof vertices in χ ( t ), such that a ( v ) ⊆ N I ( v ) for each v ∈ χ ( t ), and a ( v ) restricted to χ ∗ ( t ) agrees with the partial DAG; and p lists the directed paths between vertices in χ ( t ). The records R ( t ) may computed recursively over the nice tree decomposition asfollows: • Join node:

Suppose t has children t and t . Then R ( t ) is the union of tuples( a, p, s ) formed by tuples ( a , p , s ) ∈ R ( t ) and ( a , p , s ) ∈ R ( t ), where (1) a = a = a ; (2) p is the transitive closure of p ∪ p ; (3) p contains no cycles;and (4) s = s + s . • Introduce node:

Suppose t is an introduce node with child t ′ , such that χ ( t ) = χ ( t ′ ) ∪ { v } . Then R ( t ) is the set of tuples ( a, p, s ) formed by pairs P ⊆ N I ( v )and ( a ′ , p ′ , s ′ ) ∈ R ( t ′ ), such that (1) a ( v ) = P ; (2) for every v ∈ χ ( t ′ ), we have a ( v ) = a ′ ( v ); (3) p is the transitive closure of p ′ ∪ { ( u, v ) : u ∈ P } ∪ { ( v , u ) : v ∈ a ′ ( u ) , u ∈ χ ( t ′ ) } ; (4) p contains no cycles; and (5) s = s ′ . • Forget node:

Suppose t is a forget node with child t ′ , such that χ ( t ′ ) = χ ( t ) ∪{ v } . Then R ( t ) is the set of tuples ( a, p, s ) formed from tuples ( a ′ , p ′ , s ′ ) ∈ R ( t ′ ),such that (1) a ( u ) = a ′ ( u ) , ∀ u ∈ χ ( t ); (2) p = { ( u, v ) ∈ p ′ : u, v ∈ χ ( t ) } ; and (3) s = s ′ + score v ( a ′ ( v )).Note that Korhonen and Parviainen (2013) present a variant of this dynamic program-ming method, also using a nice tree decomposition, which is applicable even for graphswith unbounded degree but bounded treewidth. They assume that the starting undirectedgraph G I is a superset of the moralized DAG. Their algorithm runs in time linear in p andexponential in w . From a theoretical perspective, we are agnostic to both methods, sinceour results on statistical consistency of the graphical Lasso require the true moralized graphto have bounded degree. However, since supp(Θ ) exactly corresponds to the edge set of M ( G ), the alternative method will also lead to correct recovery. In practice, the relativeeﬃciency of the two dynamic programming algorithms will rely heavily on the structure of M ( G ). We ﬁrst review the runtime of various components of the dynamic programming algorithmdescribed in Section 6.2. This is mentioned brieﬂy in Korhonen and Parviainen (2013), butwe include details here for completeness. In our calculations, we assume the treewidth w of G is bounded and treat w as a constant.The ﬁrst step involves constructing a tree decomposition of minimal treewidth w , whichmay be done in time O ( p ). The second step involves constructing a nice tree decomposition.Given a tree decomposition of width w , a nice tree decomposition with O ( p ) nodes and oh and B¨uhlmann treewidth w may be constructed in O ( p ) time (see Appendix A.2). Finally, the third stepinvolves computing records for nodes in the nice tree decomposition. We consider the threediﬀerent types of nodes in succession. Note that |R ( t ) | ≤ ( w +1)( w + d ) , ∀ t, (33)where d = max j | N I ( j ) | . This is because the number of choices of parent sets of any vertexin χ ( t ) is bounded by 2 d , leading to a factor of 2 d ( w +1) , and the number of possible pairsthat are connected by a path is bounded by 2 ( w +1) w . • If t is a join node with children t and t , we may compute R ( t ) by comparingpairs of records in R ( t ) and R ( t ); by inequality (33), this may be done in time O (2 w +1)( w + d ) ). • If t is an introduce node with child t ′ , we may compute R ( t ) by considering recordsin R ( t ′ ) and parent sets of the introduced node v . Since the number of choicesfor the latter is bounded by 2 d , we conclude that R ( t ) may be computed in time O (2 ( w +1)( w + d )+ d ). • Clearly, if t is a forget node, then R ( t ) may be computed in time O (2 ( w +1)( w + d ) ).Altogether, we conclude that all records of nodes in the nice tree decomposition maybe computed in time O ( p · w +1)( w + d ) ). Combined with the graphical Lasso preprocessingstep for estimating M ( G ), this leads to an overall complexity of O ( p ). This may becompared to the runtime of other standard methods for causal inference, including the PCalgorithm (Spirtes et al., 2000), which has computational complexity O ( p w ), and (direct)LiNGAM (Shimizu et al., 2006, 2011), which requires time O ( p ). It has been noted thatboth the PC and LiNGAM algorithms may be expedited when prior knowledge about theDAG space is available, further highlighting the power of Theorem 2 as a preprocessing stepfor any causal inference algorithm.

7. Discussion

We have provided a new framework for estimating the DAG corresponding to a linear SEM.We have shown that the inverse covariance matrix of linear SEMs always reﬂects the edgestructure of the moralized graph, even in non-Gaussian settings, and the reverse statementalso holds under a mild faithfulness assumption. Furthermore, we have shown that whenthe error variances are known up to close precision, a simple weighted squared ℓ -loss maybe used to select the correct DAG. As a corollary, we have established identiﬁability forthe class of linear SEMs with error variances speciﬁed up to a constant multiple. We haveproved that our methods are statistically consistent, under reasonable assumptions on thegap between the score of the true DAG and the next best DAG in the model class. Acharacterization of this gap parameter for various graphical structures is the topic of futurework.We have also shown how dynamic programming may be used to select the best-scoringDAG in an eﬃcient manner, assuming the treewidth of the moralized graph is small. Ourresults relating the inverse covariance matrix to the moralized DAG provide a powerful inear SEMs via inverse covariance estimation method for reducing the DAG search space as a preprocessing step for dynamic program-ming, and are the ﬁrst to provide rigorous guarantees for when the graphical Lasso may beused in non-Gaussian settings. Note that the dynamic programming algorithm only usesthe information that the true DAG has skeleton lying in the input graph, and does not incorporate any information about (a) the fact that the data comes from a linear SEM;or (b) the fact that the input graph exactly equals the moralized DAG. Intuitively, bothtypes of information should place signiﬁcant constraints on the restricted DAG space, lead-ing to further speedups in the dynamic programming algorithm. Perhaps these restrictionswould make it possible to establish a version of dynamic programming for DAGs where themoralized graph has bounded degree but large treewidth.An important open question concerns scoring candidate DAGs when the diagonal matrixΩ of error variances is unknown. As we have seen, using the weighted squared ℓ -loss toscore DAGs may produce a graph that is far from the true DAG when Ω is misspeciﬁed.Alternatively, it would be useful to have a checkable condition that would allow us to verifywhether a given matrix Ω will correctly select the true DAG, or to be able to select the trueΩ from among a ﬁnite collection of candidate matrices. Acknowledgments

We acknowledge all the members of the Seminar f¨ur Statistik for providing an immenselyhospitable and fruitful environment when PL was visiting ETH, and the Forschungsinstitutf¨ur Mathematik at ETH Z¨urich for ﬁnancial support. We also thank Varun Jog for helpfuldiscussions. PL was additionally supported by a Hertz Fellowship and an NSF GraduateResearch Fellowship.

Appendix A. Graph-theoretic concepts

In this Appendix, we review some fundamental concepts in graph theory that we use inour exposition. We begin by discussing junction trees, and then move to the related notionof tree decompositions. Note that these are purely graph-theoretic operations that may beperformed on an arbitrary undirected graph.

A.1 Junction trees

We begin with the basic junction tree framework. For more details, see Lauritzen (1996)or Koller and Friedman (2009).For an undirected graph G = ( V, E ), a triangulation is an augmented graph e G = ( V, e E )that contains no chordless cycles of length greater than three. By classical graph theory,any triangulation e G gives rise to a junction tree representation of G , where nodes in thejunction tree are subsets of V corresponding to maximal cliques of e G , and the intersectionof any two adjacent cliques C and C in the junction tree is referred to as a separator set S = C ∩ C . Furthermore, any junction tree must satisfy the running intersection property ,meaning that for any two nodes in the junction tree, say corresponding to cliques C j and C k , the intersection C j ∩ C k must belong to every separator set on the unique path between oh and B¨uhlmann C j and C k in the junction tree. The treewidth of G is deﬁned to be one less than the sizeof the largest clique in any triangulation e G of G , minimized over all triangulations.As a classic example, note that if G is a tree, then G is already triangulated, and thejunction tree parallels the tree structure of G . The maximal cliques in the junction treeare equal to the edges of G and the separator sets correspond to singleton vertices. Thetreewidth of G is consequently equal to 1. A.2 Tree decompositions

We now highlight some basic concepts of tree decompositions and nice tree decompositionsused in our dynamic programming framework. Our exposition follows Kloks (1994).Let G = ( V, E ) be an undirected graph. A tree decomposition of G is a tree T with nodeset W such that each node t ∈ W is associated with a subset V t ⊆ V , and the followingproperties are satisﬁed:(a) S t ∈ T V t = V ;(b) for all ( u, v ) ∈ E , there exists a node t ∈ W such that u, v ∈ V t ;(c) for each v ∈ V , the set of nodes { t : v ∈ V t } forms a subtree of T .The width of the tree decomposition is max t ∈ T | V t | −

1. The treewidth of G is the minimalwidth of any tree decomposition of G ; this quantity agrees with the treewidth deﬁnedin terms of junction trees in the previous section. If G has bounded treewidth, a treedecomposition with minimum width may be constructed in time O ( | V | ) (cf. Chapter 15of Kloks (1994)).A nice tree decomposition is rooted tree decomposition satisfying the following proper-ties:(a) every node has at most two children;(b) if a node t has two children r and s , then V t = V r = V s ;(c) if a node t has one child s , then either(i) | V t | = | V s | + 1 and V s ⊆ V t , or(ii) | V s | = | V t | + 1 and V t ⊆ V s .Nodes of the form (b), (c)(i), and (c)(ii) are called join nodes, introduce nodes, and forget nodes, respectively. Given a tree decomposition of G with width w , a nice tree decompo-sition with width w and at most 4 | V | nodes may be computed in time O ( | V | ) (cf. Lemma13.1.3 of Kloks (1994)). Appendix B. Matrix derivations

In this section, we present a few matrix results that are used to prove Theorem 7.Deﬁne a unit lower triangular (LT) matrix to be a lower triangular matrix with 1’son the diagonal. Recall that matrices A and B are permutation similar if there exists apermutation matrix P such that A = P BP T . Call a matrix permutation unit LT if it ispermutation similar to a unit lower triangular matrix. We have the following lemma: inear SEMs via inverse covariance estimation Lemma 25

Suppose A and B are permutation unit LT matrices, and suppose AA T = BB T .Then A = B . Proof

Under the appropriate relabeling, we assume without loss of generality that A isunit LT. There exists a permutation matrix P such that C := P BP T is also unit LT. Wehave P AA T P T = CC T . (34)Let π be the permutation on { , . . . , n } such that P i,π ( i ) = 1 for all i , and P has 0’severywhere else. Deﬁne the notation e a ij := ( P A ) ij , and m ij := ( CC T ) ij , and let { c ij } denote the entries of C . We will make use of the following equalities, whichfollow from equation (34) and the fact that C is unit LT: X k e a ik e a jk = m ij = X k j, (35)and X k e a ik = m ii = 1 + X k

1. We ﬁrst show that e a i,π ( j ) = c ij for all j < i by asub-induction on j . For j = 1, we have by equation (35) and the base result for i = 1 that e a i,π (1) = m i, = c i, , which is exactly what we want. For the sub-induction step, consider 1 < j < i , and suppose e a i,π ( ℓ ) = c iℓ for all ℓ < j . Note that e a j,π ( ℓ ) = 0 for all ℓ > j by the outer induction hypothesis.Hence, equation (35) and the fact that e a j,π ( j ) = 1 gives X ℓ i . This establishes equation (37).We also need the following known result (cf. Exercise 7.8.19 in Horn and Johnson(1990)). We include a proof for completeness. Lemma 26

Suppose A ∈ R n × n is positive deﬁnite with det( A ) = 1 . Then min { tr( AB ) : B ≻ and det( B ) = 1 } = n. Proof

Consider the singular value decomposition A = U Λ U ∗ , and note that tr( AB ) =tr(Λ( U ∗ BU )). Denote b ij := ( U ∗ BU ) ij and λ i := Λ ii . Then by the AM-GM inequality andHadamard’s inequality, we have1 n · tr(Λ( U ∗ BU )) ≥ Y i λ i b ii ! /n = det( A ) · Y i b ii ! /n ≥ (det( U ∗ BU )) /n = 1 , implying the result.Building upon Lemmas 25 and 26, we obtain the following result: Lemma 27

Suppose A and B are n × n permutation unit LT matrices. Then min B tr( AA T B T B ) ≥ n, (39) with equality achieved if and only if B = A − . Proof

Write A ′ = AA T and B ′ = B T B , and note that since det( A ) = det( B ) = 1, we alsohave det( A ′ ) = det( B ′ ) = 1. Then inequality (39) holds by Lemma 26.To recover the conditions for equality, note that equality holds in Hadamard’s inequalityif and only if some b ii = 0 or the matrix U ∗ BU is diagonal. Note that the ﬁrst case is notpossible, since U ∗ BU ≻

0. In the second case, we see that in addition, we need b ii = λ i forall i in order to achieve equality in the AM-GM inequality. It follows that U ∗ BU = Λ − ,so AA T = B − B − T .Since A and B are permutation unit LT, Lemma 25 implies that the last equality canonly hold when B = A − . Appendix C. Proofs for population-level results

In this section, we provide proofs for the remaining results in Sections 3 and 4. inear SEMs via inverse covariance estimation C.1 Proof of Lemma 1

We ﬁrst show that Ω is a diagonal matrix. Consider j < k ; we will show that ǫ j ⊥⊥ ǫ k , fromwhich we conclude that E[ ǫ j ǫ k ] = E[ ǫ j ] · E[ ǫ k ] = 0 . Indeed, we have ǫ k ⊥⊥ ( X , . . . , X k − ) by assumption. Since ǫ j = X j − b Tj X is a deterministicfunction of ( X , . . . , X j ), it follows that ǫ k ⊥⊥ ǫ j , as claimed.Turning to equations (4) and (5), note that since B ∈ U , the matrix ( I − B ) is alwaysinvertible, and by equation (3), we haveΣ = ( I − B ) − T Ω( I − B ) − , (40)and Θ = Σ − = ( I − B )Ω − ( I − B ) T . (41)Then expanding and using the fact that B is upper triangular and Ω is diagonal, we obtainequations (4) and (5). C.2 Proof of Lemma 6

Since G ⊆ G , we have Pa G ( j ) ⊆ Pa G ( j ), for each j . Furthermore, no element of Pa G ( j )may be a descendant of j in G , since this would contradict the fact that G contains nocycles. By the Markov property of G , we therefore have X j ⊥⊥ X Pa G ( j ) \ Pa G ( j ) | X Pa G ( j ) . Thus, the linear regression coeﬃcients for X j regressed upon X Pa G ( j ) are simply the linearregression coeﬃcients for X j regressed upon X Pa G ( j ) (and the remaining coeﬃcients for X Pa G ( j ) \ Pa G ( j ) are zero). By Remark 5, we conclude that B = B G = arg min B ∈U G { score Ω ( B ) } , and the uniqueness of B follows from the uniqueness of B G . C.3 Proof of Theorem 9

From the decomposition (6), it is easy to see that for any B ∈ U , we have a min ≤ score Ω ( B )score Ω ( B ) ≤ a max , (42)simply by comparing individual terms; e.g.,1(Ω ) jj · E[( X j − b Tj X ) ] ≤ max j (cid:26) / (Ω ) jj / (Ω ) jj (cid:27) · ) jj · E[( X j − b Tj X ) ]= a max (cid:18) ) jj · E[( X j − B Tj X ) ] (cid:19) . oh and B¨uhlmann Note that if G ⊇ G , then by Lemma 6, the matrix B is the unique minimizer ofscore Ω ( B ) among the class U G . Now consider G G and B ∈ U G . We have (cid:18) ξp (cid:19) · score Ω ( B ) = min G ′ ∈D , G ′ G { score Ω ( G ′ ) } ≤ score Ω ( G ) ≤ score Ω ( B ) , (43)where we have used the deﬁnition of the gap (10) and the fact that score Ω ( B ) = p byTheorem 7 in the ﬁrst inequality. Hence,score Ω ( B ) ≤ a max · score Ω ( B ) ≤ a max ξ/p · score Ω ( B ) ≤ a max a min (1 + ξ/p ) · score Ω ( B ) , where the ﬁrst and third inequalities use inequality (42), and the second inequality usesinequality (43). By the assumption (11), it follows thatscore Ω ( B ) ≤ score Ω ( B ) , as wanted. The statement regarding strict inequality is clear. C.4 Proof of Lemma 11

We ﬁrst consider the case when r ≥

1. Then a max a min = r , so combining Theorem 9 with theexpression (14), we have the suﬃcient condition r ≤ b r + b r ) . Rearranging gives b − r ( r − b − r ( r − ≥ , which is equivalent to b ≥ r ( r −

1) + p (4 r ( r − + 8 r ( r − . Simplifying yields the desired expression.If instead r ≤

1, we have a max a min = r , so the suﬃcient condition becomes1 r ≤ b r + b r ) , which is equivalent to b − − r ) b − r (1 − r ) ≥ , or b ≥ − r ) + p − r ) + 8 r (1 − r )2 . Simplifying further yields the expression. inear SEMs via inverse covariance estimation c c c d d d e e e f f f (a) (b) (c) (d) Figure 3:

Alternative DAGs.

C.5 Proof of Lemma 13

To compute the gap ξ , it is suﬃcient to consider the graphs in Figure 3. Indeed, we havescore Ω ( G ) ≤ score Ω ( G ′ ) whenever G ′ ⊆ G , so we only need to consider maximal elementsin the poset of DAGs not containing G . Consider the graphs given by autoregressionmatrices C =  c

00 0 0 c c  , E =  e e e  , corresponding to the DAGs in panels (a) and (c) of Figure 3. A simple calculation showsthatscore Ω ( C ) = 3 + c b + c b + c b d d + (cid:18) c d d + c b d d (cid:19) + (cid:18) c d d − b d d (cid:19) + (cid:18) c d d − b d d (cid:19) , which is minimized for c = − b c , c = b d d + b d d + b , c = b d d + b , leading to ξ = min c ,c ,c { score Ω ( C ) − score Ω ( B ) } = b d d + b d d + b + b b d d d d + b d d + b d d d . (44)Similarly, we may computescore Ω ( E ) = 3 + (cid:18) e d d + e b d d (cid:19) + (cid:18) e d d − b d d (cid:19) + (cid:18) e d d − b d d (cid:19) , which is minimized for e = − b , e = b , e = b d d + b , oh and B¨uhlmann leading to ξ = min e ,e ,e { score Ω ( E ) − score Ω ( B ) } = b d d + b d d . (45)Finally, note that the graphs in panels (b) and (d) of Figure 3 are mirror images of thegraphs in panels (a) and (c), respectively. Hence, we obtain ξ = min d ,d ,d { score Ω ( D ) − score Ω ( B ) } = b d d + b d d + b + b b d d d d + b d d + b d d d ,ξ = min f ,f ,f { score Ω ( F ) − score Ω ( B ) } = b d d + b d d , simply by swapping the roles of nodes 1 and 2. Taking ξ = min { ξ , ξ , ξ , ξ } then yieldsthe desired result. Appendix D. Proofs for statistical consistency

In this Appendix, we provide the proofs for the lemmas on statistical consistency stated inSection 5.

D.1 Proof of Lemma 15

This result follows from the fact that k b Θ − Θ k max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , together with results on the spectral norm of sub-Gaussian covariances and their inverses(see Lemma 29 in Appendix E). D.2 Proof of Lemma 18

First consider a ﬁxed pair ( j, S ) such that S ⊆ N Θ ( j ). We may write x j = b Tj x S + e j , (46)where e j has zero mean and is uncorrelated with x S (and also depends on the choice of S ).In matrix notation, we have b b j = ( X TS X S ) − ( X TS X j ) = ( X TS X S ) − X TS ( X S b j + E j ) = b j + ( X TS X S ) − X TS E j , where the second equality follows from equation (46). Hence, σ j · b f σ j ( S ) = 1 n k X j − X S b b j k = 1 n k X S ( b j − b b j ) + E j k = 1 n k ( I − ( X TS X S ) − X TS ) E j k . (47) inear SEMs via inverse covariance estimation By the triangle inequality, we have (cid:12)(cid:12)(cid:12) k ( I − ( X TS X S ) − X TS ) E j k − k E j k (cid:12)(cid:12)(cid:12) ≤ k ( X TS X S ) − X TS E j k ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X TS X S ) − X TS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · k E j k . (48)Furthermore, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X TS X S ) − X TS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X S ( X TS X S ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup k v k ≤ (cid:8) v T ( X TS X S ) − X TS X S ( X TS X S ) − v (cid:9) / = sup k v k ≤ (cid:8) v T ( X TS X S ) − v (cid:9) / = 1 √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) X TS X S n (cid:19) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / ≤ C √ n (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Σ − SS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Σ − SS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · max { δ, δ } (cid:17) / , with probability at least 1 − − cnt ), where δ = c ′ q | S | n + c ′′ t , by Lemma 29. Takinga union bound over all 2 d choices for S and p choices for j , and setting t = c q d +log pn , wehave (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X TS X S ) − X TS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ′ √ n , ∀ S s.t. S ⊆ N Θ ( j ) for some j, (49)with probability at least1 − c exp( − c nt ) = 1 − c exp( − c nt + d log 2 + log p ) ≥ − c exp( − c ′ ( d + log p )) . Combining inequalities (48) and (49), we have the uniform bound (cid:18) − C ′ √ n (cid:19) k E j k ≤ k I − ( X TS X S ) − X TS ) E j k ≤ (cid:18) C ′ √ n (cid:19) k E j k , w.h.p., which together with equation (47) implies that (cid:12)(cid:12)(cid:12)(cid:12) σ j · b f σ j ( S ) − n k E j k (cid:12)(cid:12)(cid:12)(cid:12) ≤ C ′ √ n · n k E j k , (50)using the fact thatmax { − (1 − a ) , (1 + a ) − } = max { a − a , a + a } = 2 a + a ≤ a, for a = C ′ √ n suﬃciently small. Furthermore,1 n E[ k E j k ] = σ j · f σ j ( S ) . oh and B¨uhlmann Note that the e j ’s are i.i.d. sub-Gaussians with parameter at most cσ , since we may write e j = e b Tj x for the appropriate e b j ∈ R p , and k e b k is bounded in terms of the eigenvalues of Σ.Applying the usual sub-Gaussian tail bounds, we then have P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) n k E j k − n E[ k E j k ] (cid:12)(cid:12)(cid:12)(cid:12) ≥ cσ t (cid:19) ≤ c exp( − c nt ) , ∀ j, and taking a union bound over j and setting t = c ′ q log pn givesmax j (cid:12)(cid:12)(cid:12)(cid:12) n k E j k − n E[ k E j k ] (cid:12)(cid:12)(cid:12)(cid:12) ≤ c σ r log pn , (51)with probability at least 1 − c exp( − c log p ). Combining inequalities (50) and (51), itfollows that σ j | b f σ j ( S ) − f σ j ( S ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12) σ j · b f σ j ( S ) − n k E j k (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) n k E j k − n E[ k E j k ] (cid:12)(cid:12)(cid:12)(cid:12) ≤ C ′ √ n n E[ k E j k ] + c σ r log pn ! + c σ r log pn ≤ c ′ σ r log pn , with probability at least 1 − c exp( − c log p ). D.3 Proof of Lemma 19

Combining inequalities (20) and (22) and using the triangle inequality, we have (cid:12)(cid:12) [ score Ω ( G ) − score Ω ( G ) (cid:12)(cid:12) ≤ p X j =1 (cid:12)(cid:12) [ score σ j (Pa G ( j )) − score σ j (Pa G ( j )) (cid:12)(cid:12) < ξ ( D Θ )2 , (52)for all G ∈ D Θ . In particular, for G ∈ D Θ such that G G , we have [ score Ω ( G ) < score Ω ( G ) + ξ ( D Θ )2 ≤ (score Ω ( G ) − ξ ( D Θ )) + ξ ( D Θ )2 < [ score Ω ( G ) , where the ﬁrst and third inequalities use inequality (52) and the second inequality uses thedeﬁnition of the gap ξ ( D Θ ). This implies inequality (23). D.4 Proof of Lemma 21

Consider G ∈ D Θ with G G , and consider G ⊇ G such that γ Ω ( G, G ) is maximized.Note that if Pa G ( j ) = Pa G ( j ), then certainly, b f σ j (Pa G ( j )) − b f σ j (Pa G ( j )) = 0 = f σ j (Pa G ( j )) − f σ j (Pa G ( j )) . inear SEMs via inverse covariance estimation Hence, (cid:12)(cid:12)(cid:0) [ score Ω ( G ) − [ score Ω ( G ) (cid:1) − (cid:0) score Ω ( G ) − score Ω ( G ) (cid:1)(cid:12)(cid:12) ≤ | H ( G, G ) | · ξ ′ , (53)using inequality (28) and the triangle inequality. Furthermore, by inequality (27), | H ( G, G ) | · ξ ′ ≤ score Ω ( G ) − score Ω ( G ) ξ ′ · ξ ′ Ω ( G ) − score Ω ( G )2 . (54)Combining inequalities (53) and (54) gives [ score Ω ( G ) − [ score Ω ( G ) ≥ score Ω ( G ) − score Ω ( G )2 = score Ω ( G ) − score Ω ( G )2 > , where the last inequality holds because of the assumption ξ ′ >

0. Hence, G arg min G ∈D Θ { [ score Ω ( G ) } , implying the desired result. D.5 Proof of Lemma 23

We begin with a simple lemma:

Lemma 28

Suppose M ( G ) admits a junction tree representation with only singleton sep-arators, and let C , . . . , C k denote the maximal cliques. If X follows a linear SEM over G ,then the marginal distribution of X over the nodes in any clique C ℓ also follows a linearSEM over C ℓ , with DAG structure speciﬁed by G ℓ , the restriction of G to C ℓ . In addition,the autoregression matrix for the marginal SEM is simply the autoregression matrix for thefull SEM restricted to the nodes in C ℓ . Proof

We relabel the nodes of G so that the natural ordering on { , . . . , p } is a topologicalorder. Clearly, this induces a topological order over the nodes of G ℓ , as well. Recall thatwe have equation (3); i.e., for each j , X j = b Tj X j − + ǫ j , where ǫ j ⊥⊥ ( X , . . . , X j − ) . (55)For each j ∈ C ℓ , we deﬁne ǫ ′ j := ǫ j + X k

0, implies P ℓ a ℓ P ℓ b ℓ > ξ , we concludefrom equation (63) and inequalities (64) and (65) thatmax G ⊇ G { γ Ω ( G, G ) } ≥ P k ′′ ℓ =1 (cid:0) score Ω ( G ℓ ) − score Ω ( G ℓ ) (cid:1)P k ′′ ℓ =1 | H ( G ℓ , G ℓ ) | ≥ min ≤ ℓ ≤ k γ Ω ( G ℓ ) . Since this result holds uniformly over all G , we have γ Ω ( G ) ≥ min ≤ ℓ ≤ k γ Ω ( G ℓ ), as well. D.6 Proof of Lemma 24

This proof is quite similar to the proof for the fully-observed case, so we only mention thehigh-level details here.We write σ j | e f σ j ( S ) − f σ j ( S ) | = (cid:12)(cid:12)(cid:12)(cid:16)b Γ jj − b Γ j,S b Γ − SS b Γ − S,j (cid:17) − (cid:0) Σ jj − Σ j,S Σ − SS Σ S,j (cid:1)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)b Γ jj − Σ jj (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)b Γ j,S b Γ − SS b Γ S,j − Σ j,S Σ − SS Σ S,j (cid:12)(cid:12)(cid:12)| {z } A . (66)The ﬁrst term may be bounded directly using inequality (32) and a union bound over j : P max j (cid:12)(cid:12)(cid:12)b Γ jj − Σ jj (cid:12)(cid:12)(cid:12) ≥ cσ r log pn ! ≤ c ′ exp( − c ′ log p ) . (67)To bound the second term, we use the following expansion: A ≤ (cid:12)(cid:12)(cid:12)b Γ j,S (cid:16)b Γ − SS − Σ − SS (cid:17) b Γ S,j (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)b Γ j,S Σ − SS (cid:16)b Γ S,j − Σ S,j (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:16)b Γ j,S − Σ j,S (cid:17) Σ − SS Σ S,j (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)b Γ − SS − Σ − SS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k b Γ S,j k + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Σ − SS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) k b Γ S,j k k b Γ S,j − Σ S,j k + k b Γ S,j − Σ S,j k k Σ S,j k (cid:17) . oh and B¨uhlmann As in the proof of Lemma 29 in Appendix E, we may obtain a bound of the form P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)b Γ − SS − Σ − SS (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ cσ r dn + t !! ≤ c exp( − c nt ) , by inverting the deviation condition (32). Furthermore, k b Γ S,j − Σ S,j k ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)b Γ S ′ S ′ − Σ S ′ S ′ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where S ′ := S ∪ { j } , which may in turn be bounded using the deviation condition (32). Wealso have k b Γ S,j k ≤ k Σ S,j k + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)b Γ S ′ S ′ − Σ S ′ S ′ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Combining these results and taking a union bound over the 2 d choices for S and p choicesfor j , we arrive at a uniform bound of the form P A ≤ c ′ σ r log pn ! ≥ − c ′ exp( − c ′ log p ) . Together with inequality (67) and the expansion (66), we then obtain the desired result.

Appendix E. Matrix concentration results

This Appendix contains matrix concentration results that are used to prove our technicallemmas. We use |||·||| to denote the spectral norm of a matrix. Lemma 29

Suppose { x i } ni =1 ⊆ R p are i.i.d. sub-Gaussian vectors with parameter σ andcovariance Σ . Then for all t ≥ , we have P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X T Xn − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ σ · max { δ, δ } (cid:19) ≥ − − cnt ) , (68) where δ = c ′ q pn + c ′′ t . Furthermore, if X T Xn is invertible and σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Σ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · max { δ, δ } ≤ , we have P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) X T Xn (cid:19) − − Σ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Σ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · max { δ, δ } ! ≥ − − cnt ) . (69) Proof

For inequality (68), see Remark 5.40 of Vershynin (2012). For inequality (69), weuse the matrix expansion( A + ∆) − = ( A ( I + A − ∆)) − = ( I + A − ∆) − A − = A − + ∞ X k =1 ( − k ( A − ∆) k A − , inear SEMs via inverse covariance estimation valid for any matrices A and ∆ such that A and A + ∆ are both invertible and the seriesconverges. By the triangle inequality and multiplicativity of the spectral norm, we thenhave (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( A + ∆) − − A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∞ X k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( A − ∆) k A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · ∞ X k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ||| A − ∆ ||| ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · ||| ∆ ||| − ||| A − ∆ ||| . We now take A = Σ and ∆ = X T Xn − Σ. By the assumption and inequality (68), we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · ||| ∆ ||| ≤ , implying that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( A + ∆) − − A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · ||| ∆ ||| . This gives the result.

References

O.O. Aalen, K. Røysland, J.M. Gran, and B. Ledergerber. Causality, mediation and time:A dynamic viewpoint.

Journal of the Royal Statistical Society: Series A (Statistics inSociety) , 175(4):831–861, 2012.P. B¨uhlmann, J. Peters, and J. Ernest. CAM: Causal additive models, high-dimensionalorder search, and penalized regression. arXiv e-prints , October 2013. Available at http://arxiv.org/abs/1310.1533 .D.M. Chickering. A transformational characterization of equivalent bayesian network struc-tures. In

Proceedings of the Eleventh conference on Uncertainty in Artiﬁcial Intelligence ,pages 87–98, 1995.D.M. Chickering. Optimal structure identiﬁcation with greedy search.

Journal of MachineLearning Research , 3:507–554, 2002.R. A. Horn and C. R. Johnson.

Matrix Analysis . Cambridge University Press, 1990.T.R. Hughes, M.J. Marton, A.R. Jones, C.J. Roberts, R. Stoughton, C.D. Armour, H.A.Bennett, E. Coﬀey, H. Dai, Y.D. He, et al. Functional discovery via a compendium ofexpression proﬁles.

Cell , 102:109–126, 2000. oh and B¨uhlmann T. Kloks.

Treewidth: Computations and Approximations , volume 842. Springer, 1994.D. Koller and N. Friedman.

Probabilistic Graphical Models: Principles and Techniques .MIT Press, 2009.J.H. Korhonen and P. Parviainen. Exact learning of bounded tree-width bayesian networks.In

Artiﬁcial Intelligence and Statistics (AISTATS 2013) , pages 370–378. JMLR, 2013.S.L. Lauritzen.

Graphical Models . Oxford University Press, 1996.P. Loh and M.J. Wainwright. High-dimensional regression with noisy and missing data:Provable guarantees with non-convexity.

Annals of Statistics , 40(3):1637–1664, 2012.P. Loh and M.J. Wainwright. Structure estimation for discrete graphical models: General-ized covariance matrices and their inverses.

Annals of Statistics , 2013. To appear.N. Meinshausen and P. B¨uhlmann. High-dimensional graphs and variable selection with theLasso.

Annals of Statistics , 34:1436–1462, 2006.C. Nowzohour and P. B¨uhlmann. Score-based causal learning in additive noise models.2013. In preparation.S. Ordyniak and S. Szeider. Algorithms and complexity results for exact Bayesian structurelearning.

CoRR , abs/1203.3501, 2012.E. Perrier, S. Imoto, and S. Miyano. Finding optimal Bayesian network given a super-structure.

Journal of Machine Learning Research , 9(2):2251–2286, 2008.J. Peters and P. B¨uhlmann. Identiﬁability of gaussian structural equation mod-els with equal error variances. arXiv e-prints , August 2013. Available at http://arxiv.org/abs/1205.2536 . To appear in

Biometrika .P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covarianceestimation by minimizing ℓ -penalized log-determinant divergence. Electronic Journal ofStatistics , 4:935–980, 2011.S. Shimizu, P.O. Hoyer, A. Hyv¨arinen, and A. Kerminen. A linear non-Gaussian acyclicmodel for causal discovery.

Journal of Machine Learning Research , 7:2003–2030, 2006.S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyv¨arinen, Y. Kawahara, T. Washio, P.O. Hoyer, andK. Bollen. Directlingam: A direct method for learning a linear non-Gaussian structuralequation model.

Journal of Machine Learning Research , 12:1225–1248, 2011.A. Shojaie and G. Michailidis. Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs.

Biometrika , 97(3):519–538, 2010.T. Silander and P. Myllymaki. A simple approach for ﬁnding the globally optimal Bayesiannetwork structure. In

Proceedings of the Twenty-Second Conference Annual Conferenceon Uncertainty in Artiﬁcial Intelligence (UAI-06) , pages 445–452, Arlington, VA, 2006.AUAI Press. inear SEMs via inverse covariance estimation P. Spirtes, C. Glymour, and R. Scheines.

Causation, prediction, and search , volume 81.The MIT Press, 2000.D.J. Stekhoven, I. Moraes, G. Sveinbj¨ornsson, L. Hennig, M.H. Maathuis, and P. B¨uhlmann.Causal stability ranking.

Bioinformatics , 28(21):2819–2823, 2012.S. van de Geer and P. B¨uhlmann. ℓ -penalized maximum likelihood for sparse directedacyclic graphs. Annals of Statistics , 41(2):536–567, 2013.R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y.C. El-dar and G. Kutyniok, editors,

Compressed Sensing: Theory and Applications . CambridgeUniversity Press, 2012.M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model.

Biometrika , 94(1):19–35, 2007., 94(1):19–35, 2007.

Related Researches

CaPC Learning: Confidential and Private Collaborative Learning

by Christopher A. Choquette-Choo

Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

by Dara Bahri

SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

by Bahare Fatemi

On Theory-training Neural Networks to Infer the Solution of Highly Coupled Differential Equations

by M. Torabi Rad

Spherical Message Passing for 3D Graph Networks

by Yi Liu

Target Training Does Adversarial Training Without Adversarial Samples

by Blerta Lindqvist

Domain Invariant Representation Learning with Domain Density Transformations

by A. Tuan Nguyen

Sparsification via Compressed Sensing for Automatic Speech Recognition

by Kai Zhen

Using Deep LSD to build operators in GANs latent space with meaning in real space

by J. Quetzalcoatl Toledo-Marin

Consensus Based Multi-Layer Perceptrons for Edge Computing

by Haimonti Dutta

RL for Latent MDPs: Regret Guarantees and a Lower Bound

by Jeongyeol Kwon

Scheduling the NASA Deep Space Network with Deep Reinforcement Learning

by Edwin Goh

Backdoor Scanning for Deep Neural Networks through K-Arm Optimization

by Guangyu Shen

On Explainability of Graph Neural Networks via Subgraph Explorations

by Hao Yuan

More Is More -- Narrowing the Generalization Gap by Adding Classification Heads

by Roee Cates

Benchmarking Deep Graph Generative Models for Optimizing New Drug Molecules for COVID-19

by Logan Ward

Estimation and Applications of Quantiles in Deep Binary Classification

by Anuj Tambwekar

Emotion Transfer Using Vector-Valued Infinite Task Learning

by Alex Lambert

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

by Florian E. Dorner

Bounded Memory Active Learning through Enriched Queries

by Max Hopkins

Adversarially Trained Models with Test-Time Covariate Shift Adaptation

by Jay Nandy

Transfer learning based few-shot classification using optimal transport mapping from preprocessed latent space of backbone neural network

by Tomáš Chobola

Regularization Strategies for Quantile Regression

by Taman Narayan

Large-Scale Training System for 100-Million Classification at Alibaba

by Liuyihan Song

Classifier Calibration: with implications to threat scores in cybersecurity

by Waleed A. Yousef

«

1

2

3

4

»

Submitted on 14 Nov 2013 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar