[PDF] Learning DAGs without imposing acyclicity

Abstract

Full PDF

LLearning DAGs without imposing acyclicity

Gherardo Varando ∗ Abstract

We explore if it is possible to learn a directed acyclic graph (DAG) from data withoutimposing explicitly the acyclicity constraint. In particular, for Gaussian distributions,we frame structural learning as a sparse matrix factorization problem and we empiricallyshow that solving an (cid:96) -penalized optimization yields to good recovery of the true graphand, in general, to almost-DAG graphs. Moreover, this approach is computationallyeﬃcient and is not aﬀected by the explosion of combinatorial complexity as in classicalstructural learning algorithms. A vast literature exists on learning DAGs from data. Usually algorithms are classiﬁedas constrained based or score based [Scutari et al., 2019]. Score-based methods optimizesome score function over the space of DAGs, usually employing some heuristic such asgreedy search. Constraint-based methods, such as the PC algorithm [Spirtes et al., 1993,Colombo and Maathuis, 2014], use instead conditional independence testing to prune edgesfrom the graph and apply sets of rules to direct some of the remaining edges and ﬁndan estimate of the Markov equivalence class [Chickering, 1995]. Recently, Zheng et al.[2018, 2020] proposed the use of optimization techniques to estimate DAG structures bywriting the acyclicity condition of a directed graph as a smooth constraints for the weightedadjacency matrix. All the methods available in the literature for structure recovery of DAGs ∗ [email protected] a r X i v : . [ s t a t . M L ] J un mpose somehow the acyclicity condition. We propose instead to consider the estimationof DAGs as a general sparse matrix-factorization problem for the inverse covariance matrixarising from linear structural equation models [Drton, 2018, Spirtes, 1995, Richardson,1997]. We estimate such sparse matrix-factorization solving an (cid:96) -penalized minus log-likelihood minimization using a straightforward proximal gradient method. The proposedmethod takes inspiration from optimization-based algorithms for structure recovery such asthe graphical lasso [Friedman et al., 2007] and especially the method proposed in Varandoand Hansen [2020] where covariance matrices are parametrized as solutions of, eventuallysparse, Lyapunov equations. In Section 4 we perform a simulation study and observe thatthe proposed method is competitive with classical approaches from the literature on therecovery of the true graph, while being generally faster.A fortran implementation of the method, together with examples of its usage within Rand python, is available at github.com/gherardovarando/nodag . A linear structural equation model (SEM) with independent Gaussian noise is a statisticalmodel for a p -dimensional random vector X deﬁned as the solution of X = Λ t X + (cid:15), (1)where we assume that (cid:15) is a p -dimensional zero-mean independent Gaussian noise andΛ ∈ R p × p . If I − Λ is invertible, equation (1) implies the covariance parametrization [Drton,2018] Var( X ) = ( I − Λ) − t Ω( I − Λ) − (2)where Ω = Var( (cid:15) ) is a diagonal positive deﬁnite matrix. The connection between GaussianBayesian networks and the system of equations (1) is immediate if we assume that thematrix Λ has a sparsity structure compatible with a given directed acyclic graph G .Inverting equation 2 we obtain the factorization of the inverse covariance matrix asVar( X ) − = ( I − Λ)Ω − ( I − Λ) t = AA t , (3)2ith A = ( I − Λ)Ω − / having the same oﬀ-diagonal sparsity pattern as Λ.Parametrizing the multivariate Gaussian distribution with the inverse covariance matrix,we can deﬁne the linear structural equation model with independent Gaussian noise, andassociate graph G , as the family of normal distributions N ( µ, Σ) with Σ − in M G = (cid:8) AA t s.t. A ∈ R p × p invertible and A ij = 0 if i (cid:54)→ j in G (cid:9) In particular when the graph G is acyclic the set of inverse covariance matrices M G corre-sponds to Gaussian Bayesian network models. It is known that in general is not possible to recover completely the graph G from observa-tional data, since diﬀerent graphs give rise to the same statistical model, in which case wesay that the two graphs are equivalent [Heckerman et al., 1994]. The equivalence classesobtained considering the quotient of the space of directed graphs with respect to the aboveequivalence are called Markov equivalence classes. In particular, if the graph G is a DAGits Markov equivalence class consists of all DAGs having the same skeleton and exactly thesame v-structures [Andersson et al., 1997, Heckerman et al., 1994]. A completely partiallydirected acyclic graph (CPDAG) can be used to represent the Markov equivalence classfor DAGs [Andersson et al., 1997]. A CPDAG is a partially directed graph where directededges represent edges that have the same directions in all DAGs belonging to the Markovclass, while undirected edges are drawn where there exists DAGs in the Markov equivalenceclass with diﬀerent directions for a given edge. Considering the factorization of the precision matrix (3) it is immediate to consider thefollowing minimization problem for the (cid:96) -penalized minus log-likelihoodminimize − A ) + trace( A t ˆ RA ) + λ || A || subject to A ∈ R p × p invertible , (4)3here ˆ R is the empirical correlation matrix estimated from the data.Solving the above problem, we estimate a sparse factorization of the inverse covari-ance matrix and the estimated graph can be read directly from the non-zero entries of theminimizer. Remark 1

The proposed structural estimation is a general estimation for linear SEM,including models with cycles. We focus on the recovery of DAGs mainly because the literatureon learning acyclic graphs is more extense and more methods are available.

Various methods can be used to solve problem (4), we present here a very simple approachusing a proximal gradient method [Bach et al., 2012, Parikh and Boyd, 2014] similar to thealgorithm proposed in Varando and Hansen [2020].The proximal gradient is a method applicable to minimization problems where the objec-tive function has a decomposition into a sum of two functions of which one is diﬀerentiable.In particular, the objective function of problem (4) can be written as f ( A ) + g ( A ) where f ( A ) = − A ) + trace( A t ˆ RA ) and g ( A ) = || A || , with f diﬀerentiable.The proximal gradient algorithm consists of iterations of the form A ( k ) = S sλ (cid:16) A ( k ) − s ∇ f (cid:17) where the soft-thresholding S t ( A ) ij = sign( A ij ) ( | A ij | − t ) + is the proximal operator for the (cid:96) -penalization and ∇ f ( A ) = 2 ˆ RA − A − . At each iteration the step size s is selectedusing the line search proposed in Beck and Tabulle [2010].A complete description of the algorithm and its implementation is given in the Ap-pendix A. Example 1

We simulate synthetic observations from a Gaussian Bayesian networkwith associated DAG given in Figure 1 (a). The graph estimated solving the optimizationproblem (4) is shown in Figure 1 (b). We can observe that all the v-structures in the trueDAG are correctly recovered. Nevertheless, the estimated graph is not a valid DAG since itcontains the directed cycle ↔ . λ = 0 . λ = 0 . We perform a simulation study to explore how the proposed method behaves with respectto the recovery of the true graph. Data is generated from Gaussian Bayesian networkswith known structure similarly to Colombo and Maathuis [2014], in particular randomDAGs with p ∈ { , , , , } nodes are generated with independent probability ofedges kp , k = 1 , , ,

4. Coeﬃcients for the linear regression of each variable on its parentsare independent realizations of a uniform distribution between 0 . p , k and noise distribution we generate 20 DAGs and subsequentlysample n = 100 , , nodag ) by solving the optimization problem (4) with λ = 0 . , . , .

3. For comparison, we consider three classical structural-recovery algo-rithms: the order independent PC [Colombo and Maathuis, 2014], the greedy equivalentsearch [Chickering, 2003], and an hill-climbing search. The PC algorithm ( pc ) and thegreedy equivalent search ( ges ) are implemented in the pcalg R package [Kalisch et al.,5012], while the hill-climbing search with tabu ( tabu ) is available in the bnlearn

R pack-age [Scutari, 2010]. For the pc method we use the Gaussian conditional independence testvia Fisher’s Z and various signiﬁcance levels (0 .

01, 0 .

005 and 0 . ges and tabu methods optimize the Bayesian information criterion, as default in their implementations.For the tabu we also ﬁx the maximum cardinality of the parent set to 10 to limit thecomputational complexity.The pc algorithm and the greedy equivalent search estimate a representation of theMarkov equivalence class while the hill-climbing search and the proposed nodag methodestimate a directed graph. Similarly to Colombo and Maathuis [2014] we evaluate the estimated graphs using the F1score ( f1 ), false positive rate ( fpr ) and true positive rate ( tpr ) with respect to the trueskeleton recovery. Figure 2 shows the average metrics for the skeleton recovery for thediﬀerent algorithms. We observe that the nodag method obtains, in general, comparableresults to the literature algorithms, while performing clearly better in the small samplesize with respect to skeleton recovery and slightly worse in the large sample size and smallgraphs.Evaluating edge directions is more tricky, since diﬀerent DAGs with the same skeletoncan be Markov equivalent (see Section 2.1), and moreover algorithms can output estimateddirected or partially directed graphs. We chose to report structural Hamming distance toboth the true DAG ( shd-graph ) and the true CPDAG ( shd-cpdag ) in Figure 3. We cansee that the proposed method is on average superior to other algorithms with respect to therecovery of the true DAG. As for the skeleton recovery, nodag performs worst in the largesample and small system dimensions. As we showed in Example 1 the graphs estimatedby nodag have,sometimes, a double edge i ↔ j , and we have observed that this happensespecially when the true CPDAG have the corresponding undirected edge i − j .In general the value of the penalization coeﬃcient λ obtaining the optimal results is, asexpected, dependent on the sample size, but on average it does not seem to be too muchsensitive. 6n Figure 4 we report the average execution time for the diﬀerent methods. The nodag method is shown to outperform all the other methods from a computational speedprospective, especially for large sample and system sizes. As an example of real-data application we consider the dataset from Sachs et al. [2005]largely used [Friedman et al., 2007, Meinshausen et al., 2016, Zheng et al., 2018, Varandoand Hansen, 2020] in the graphical models and causal discovery literature.Data consist of observations of phosphorylated proteins and phospholipids ( p = 11) fromcells under diﬀerent conditions ( n = 7466).We estimate the graph representing the protein-signaling network with our nodag methodwith λ = 0 . nodag method estimates a graph without cycles and with12 edges of which 4 are also present in the consensus network [Sachs et al., 2005]. Towestimated edges, jnk → pkc and mek → raf, appear with reversed direction in the consensusnetwork while others (jnk → p38, akt → erk, akt → mek) appear in the graph estimatedfrom other methods in the literature [Meinshausen et al., 2016, Varando and Hansen, 2020]. We have framed the problem of learning DAGs in the larger class of linear structural equa-tion models and we have shown that a simple approach based on optimization techniquesand without any acyclicity constraints is able to obtain similar recovery performances thanstate-of-the-art algorithms, see Figures 2 and 3.The proposed nodag is also considerably faster than classical constraint-based and score-based methods (Figure 4). By avoiding the acyclicity constraint we are able to use astandard proximal gradient algorithm over a matrix A ∈ R p × p and thus the computationalcost of the method depends only on the size of the system ( p ) and not on the sparsity levelor the sample size. Simulations performed on a standard laptop with 8Gb of RAM and an i5-8250U CPU

00 1000 10000 f f p r t p r p gesnodag−0.1 nodag−0.2nodag−0.3 pc−0.001pc−0.005 pc−0.01tabu Figure 2: Average F1 score, true positive rate and false positive rate for the recovery of theskeleton. Diﬀerent algorithms in diﬀerent colours, the proposed nodag method with solidlines and algorithms from the literature dashed. Results for nodag-0.1 are not shown for n = 100 to improve readability. 8

00 1000 10000 s hd − c pdag s hd − g r aph p emptyges nodag−0.1nodag−0.2 nodag−0.3pc−0.001 pc−0.005pc−0.01 tabu Figure 3: Average structural hamming distances with respect to the true CPDAG( shd-cpdag ) and to the true DAG ( shd-graph ). Diﬀerent algorithms in diﬀerent colours,the proposed nodag method with solid lines and algorithms from the literature dashed.The distance with respect to the empty graph is shown with black dotted lines. Results for nodag-0.1 are not shown for n = 100 to improve readability.9

00 1000 10000 t i m e p gesnodag−0.1 nodag−0.2nodag−0.3 pc−0.001pc−0.005 pc−0.01tabu Figure 4: Average execution time for diﬀerent algorithms (colours), the proposed nodag method with solid lines and algorithms from the literature dashed.Moreover, the output of the nodag method parametrizes the inverse covariance matrixand thus provides an estimation of the parameters of the model, contrary to constraint-based algorithms. In particular, from the estimation of the A matrix is possible to recoveran estimate of the coeﬃcient matrix Λ in equation (3). The nodag method estimates linear SEM without acyclicity constraints and thus it wouldbe interesting to empirically test its performance in the recovery of SEM with cycles.As for lasso [Friedman et al., 2010], graphical lasso [Friedman et al., 2007] and other (cid:96) -penalized methods [Varando and Hansen, 2020] it is straightforward to extend the methodto estimate the regularization path for a sequence of decreasing λ values and thus being ableto perform data-driven selection for the regularization coeﬃcient. The good computationalcomplexity make it feasible to combine the proposed algorithm with stability selection10afmekplcgpip2 pip3 erkaktpkapkcp38jnkFigure 5: Graph estimated as the support of A ∗ from solving the optimization problem (4)with λ = 0 .

2. In blue edges that are present in the conventionally accepted network [Sachset al., 2005]. 11ethods Meinshausen and B¨uhlmann [2010] and, by not imposing acyclicity, it allows tosimply average matrices estimated from bootstrapped or sub-sampled data.Finally, from a strictly computational point of view, it would be interesting to exploreways to speed up both the computations and the convergence of Algorithm 1. The sparsityof matrix A in Algorithm 1 could probably be used to speed up or avoid the computationof its LU decomposition, this would in turn decrease the cost-per-iteration. Acceleratedgradient-like methods could be applied to improve the speed of convergence [Beck andTeboulle, 2009], thus reducing the number of iterations needed to converge. The code and instructions to reproduce the examples, the simulation study and the real-world application are available at https://github.com/gherardovarando/nodag_experiments . Acknowledgments

This work was supported by VILLUM FONDEN (grant 13358).

A Algorithm implementation

Algorithm 1 details the pseudocode of the proposed proximal method to solve problem 4.Each iteration of the algorithm consists basically in the gradient computation and the linesearch loop where the gradient descent and the proximal operator are applied for decreas-ingly small step sizes until the descent and the Beck and Tabulle [2010] conditions are met.The algorithm terminates when the diﬀerence of the objective function computed in thelast two iterations is less than a speciﬁed tolerance ( ε ) or when the maximum number ofiterations ( M ) has been reached. The LU factorization of A , used to compute both thelog-determinant and the inverse, is performed with the LAPACK implementation Andersonet al. [1999] using partial pivoting.The fortran code implementing algorithm 1 is available at github.com/gherardovarando/nodag .12 lgorithm 1 Proximal gradient algorithm for minimization of (cid:96) -penalized minus log-likelihood input: ˆ R the empirical correlation matrix, M ∈ N , ε > , λ > , α ∈ (0 , initialize k = 0, A = L = U = I f = (cid:80) pi =1 ˆ R ii g = 0 repeat increase iteration counter k = k + 1 compute A − using the LU decomposition obtain the gradient D = 2 ˆ RA − A − copy A into A (cid:48) f (cid:48) = f , g (cid:48) = g s = 1 loop A = A (cid:48) − sD soft thresholding A at level sλ L, U = LU decomposition of A compute ˆ RA f = − (cid:80) pi =1 log( U ii ) + (cid:80) pi,j =1 A ij (cid:16) ˆ RA (cid:17) ij g = λ || A || ν = (cid:80) pi,j s ( A ij − A (cid:48) ij ) + ( A ij − A (cid:48) ij ) D ij if f ≤ f (cid:48) + ν and f + g ≤ f (cid:48) + g (cid:48) then break else s = αs end if end loop δ = ( f (cid:48) + g (cid:48) − f − g ) until k > M or δ < ε output: A , f , δ , k References

E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz,A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen.

LAPACK Users’ Guide .13ociety for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999.Steen A. Andersson, David Madigan, and Michael D. Perlman. A characterization of Markovequivalence classes for acyclic digraphs.

Ann. Statist. , 25(2):505–541, 04 1997.Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimizationwith sparsity-inducing penalties.

Found. Trends Mach. Learn. , 4(1):1–106, 2012. ISSN1935-8237.A. Beck and M. Tabulle. Gradient-based algorithms with applications to signal recoveryproblems. In D. Palomar and Y. Eldar, editors,

Convex Optimization in Signal Processingand Communications , pages 42–88. Cambridge University Press, 2010.Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linearinverse problems.

SIAM Journal on Imaging Sciences , 2(1):183–202, 2009.David Maxwell Chickering. A transformational characterization of equivalent Bayesiannetwork structures. In

Proceedings of the Eleventh Conference on Uncertainty in ArtiﬁcialIntelligence , UAI’95, page 87–98, San Francisco, CA, USA, 1995. Morgan KaufmannPublishers Inc.David Maxwell Chickering. Optimal structure identiﬁcation with greedy search.

Journal ofMachine Learning Research , 3:507–554, 2003.Diego Colombo and Marloes H. Maathuis. Order-independent constraint-based causal struc-ture learning.

Journal of Machine Learning Research , 15(116):3921–3962, 2014.Mathias Drton. Algebraic problems in structural equation modeling. In

The 50th An-niversary of Gr¨obner Bases , pages 35–86, Tokyo, Japan, 2018. Mathematical Society ofJapan.J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with thegraphical lasso.

Biostatistics , 9(3):432–441, 12 2007.J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linearmodels via coordinate descent.

Journal of Statistical Software , 33(1):1–22, 2010.14avid Heckerman, Dan Geiger, and David M. Chickering. Learning Bayesian networks:The combination of knowledge and statistical data. In Ramon Lopez [de Mantaras] andDavid Poole, editors,

Uncertainty Proceedings 1994 , pages 293 – 301. Morgan Kaufmann,San Francisco (CA), 1994.Markus Kalisch, Martin M¨achler, Diego Colombo, Marloes H. Maathuis, and PeterB¨uhlmann. Causal inference using graphical models with the R package pcalg.

Jour-nal of Statistical Software , 47(11):1–26, 2012.N. Meinshausen and P. B¨uhlmann. Stability selection.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 72(4):417–473, 2010.N. Meinshausen, A. Hauser, J. M. Mooij, J. Peters, P. Versteeg, and P. B¨uhlmann. Methodsfor causal inference from gene perturbation experiments and validation.

Proceedings ofthe National Academy of Sciences , 113(27):7361–7368, 2016.N. Parikh and S. Boyd. Proximal algorithms.

Found. Trends Optim. , 1(3):127–239, 2014.Thomas Richardson. A characterization of markov equivalence for directed cyclic graphs.

International Journal of Approximate Reasoning , 17(2):107 – 162, 1997. Uncertainty inAI (UAI’96) Conference.K. Sachs, O. Perez, D. Pe’er, D. A. Lauﬀenburger, and G. P. Nolan. Causal protein-signalingnetworks derived from multiparameter single-cell data.

Science , 308(5721):523–529, 2005.Marco Scutari. Learning bayesian networks with the bnlearn R package.

Journal of Statis-tical Software , 35(3):1–22, 2010. doi: 10.18637/jss.v035.i03.Marco Scutari, Catharina Elisabeth Graaﬂand, and Jos´e Manuel Guti´errez. Who learnsbetter Bayesian network structures: Accuracy and speed of structure learning algorithms.

International Journal of Approximate Reasoning , 115:235 – 253, 2019.Peter Spirtes. Directed cyclic graphical representations of feedback models. In

Proceedingsof the Eleventh Conference on Uncertainty in Artiﬁcial Intelligence , 02 1995.15eter Spirtes, Clark Glymour, and Richard Scheines.

Causation, Prediction, and Search ,volume 81. 01 1993.Gherardo Varando and Niels Ricchard Hansen. Graphical continuous Lyapunov models.

Submitted , 2020.Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. DAGs with NO TEARS:Continuous Optimization for Structure Learning. In

Advances in Neural InformationProcessing Systems , 2018.Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. Learningsparse nonparametric DAGs. In