[PDF] Integer Programming for Learning Directed Acyclic Graphs from Continuous Data

Abstract

Learning directed acyclic graphs (DAGs) from data is a challenging task both in theory and in practice, because the number of possible DAGs scales superexponentially with the number of nodes. In this paper, we study the problem of learning an optimal DAG from continuous observational data. We cast this problem in the form of a mathematical programming model which can naturally incorporate a super-structure in order to reduce the set of possible candidate DAGs. We use the penalized negative log-likelihood score function with both ℓ 0 and ℓ 1 regularizations and propose a new mixed-integer quadratic optimization (MIQO) model, referred to as a layered network (LN) formulation. The LN formulation is a compact model, which enjoys as tight an optimal continuous relaxation value as the stronger but larger formulations under a mild condition. Computational results indicate that the proposed formulation outperforms existing mathematical formulations and scales better than available algorithms that can solve the same problem with only ℓ 1 regularization. In particular, the LN formulation clearly outperforms existing methods in terms of computational time needed to find an optimal DAG in the presence of a sparse super-structure.

Full PDF

IInteger Programming for Learning Directed Acyclic Graphsfrom Continuous Data

Hasan Manzour ∗ Simge K¨u¸c¨ukyavuz † Ali Shojaie ‡ April 25, 2019

Abstract

Learning directed acyclic graphs (DAGs) from data is a challenging task both in theory andin practice, because the number of possible DAGs scales superexponentially with the numberof nodes. In this paper, we study the problem of learning an optimal DAG from continuousobservational data. We cast this problem in the form of a mathematical programming modelwhich can naturally incorporate a super-structure in order to reduce the set of possible can-didate DAGs. We use the penalized negative log-likelihood score function with both (cid:96) and (cid:96) regularizations and propose a new mixed-integer quadratic optimization (MIQO) model,referred to as a layered network (LN) formulation. The LN formulation is a compact model,which enjoys as tight an optimal continuous relaxation value as the stronger but larger formu-lations under a mild condition. Computational results indicate that the proposed formulationoutperforms existing mathematical formulations and scales better than available algorithmsthat can solve the same problem with only (cid:96) regularization. In particular, the LN formulationclearly outperforms existing methods in terms of computational time needed to ﬁnd an optimalDAG in the presence of a sparse super-structure. The study of Probabilistic Graphical Models (PGMs) is an essential topic in modern artiﬁcialintelligence [23]. A PGM is a rich framework that represents the joint probability distribution anddependency structure among a set of random variables in the form of a graph. Once learned fromdata or constructed from expert knowledge, PGMs can be utilized for probabilistic reasoning tasks,such as prediction; see [23, 24] for comprehensive reviews of PGMs.Two most common classes of PGMs are

Markov networks (undirected graphical models) and

Bayesian networks (directed graphical models). A Bayesian Network (BN) is a PGM in which theconditional probability relationships among random variables are represented in the form of a Di-rected Acyclic Graph (DAG). BNs use the richer language of directed graphs to model probabilisticinﬂuence among random variables that have clear directionality; they are particularly popular inpractice, with applications in genetics [45], biology [27], machine learning [23], and causal inference[39]. ∗ Department of Industrial and Systems Engineering, University of Washington ([email protected]). † Department of Industrial Engineering and Management Sciences, Northwestern University([email protected]). ‡ Department of Biostatistics, University of Washington ([email protected]). a r X i v : . [ c s . L G ] A p r earning BNs is a central problem in machine learning. An essential part of this problementails learning the DAG structure that accurately represents the (hypothetical) joint probabilitydistribution of the BN. Although one can form the DAG based on expert knowledge, acquisition ofknowledge from experts is often costly and nontrivial. Hence, there has been considerable interestin learning the DAG directly from observational data [3, 6, 9, 10, 17, 29, 34, 39].Learning the DAG which best explains observed data is an NP-hard problem [5]. Despite thisnegative theoretical result, there has been interest in developing methods for learning DAGs inpractice. There are two main approaches for learning DAGs from observational data: constraint-based and score-based. In constraint-based methods, such as the well-known PC-Algorithm [39],the goal is to learn a completed partially DAG (CPDAG) consistent with conditional independencerelations inferred from the data. Score-based methods, including the approach in this paper, aimto ﬁnd a DAG that maximizes a score that measures how well the DAG ﬁts the data.Existing DAG learning algorithms can also be divided into methods for discrete and continuousdata. Score-based methods for learning DAGs from discrete data typically involve a two-stagelearning process. In stage 1, the score for each candidate parent set (CPS) of each node is computed.In stage 2, a search algorithm is used to maximize the global score, so that the resulting graph isacyclic. Both of these stages require exponential computation time [42]. For stage 2, there existelegant exact algorithms based on dynamic programming [13, 21, 22, 30, 32, 36, 44], A (cid:63) algorithm[43, 44], and integer-programming [2, 3, 7, 8, 9, 10]. The A (cid:63) algorithm identiﬁes the optimalDAG by solving a shortest path problem in an implicit state-space search graph, whereas integerprogramming (IP) directly casts the problem as a constrained optimization problem. Speciﬁcally,the variables in the IP model indicate whether or not a given parent set is assigned to a node in thenetwork. Hence, the problem involves m m − binary variables for m nodes. To reduce the numberof binary variables, a common practice is to limit the cardinality of each parent set [3, 9], whichcan lead to suboptimal solutions.A comprehensive empirical evaluation of A (cid:63) algorithm and IP methods for discrete data isconducted in [26]. The results show that the relative eﬃciency of these methods varies due to theintrinsic diﬀerences between them. In particular, state-of-the-art IP methods can solve instancesup to 1,000 CPS per variable regardless of the number of nodes, whereas A (cid:63) algorithm works forproblems with up to 30 nodes, even with tens of thousands of CPS per node.While statistical properties of DAG learning from continuous data have been extensively studied[15, 25, 34, 37, 41], the development of eﬃcient computational tools for learning the optimal DAGfrom continuous data remains an open challenge. In addition, despite the existence of elegantexact search algorithms for discrete data, the literature on DAG learning from continuous data hasprimarily focused on approximate algorithms based on coordinate descent [14, 17] and non-convexcontinuous optimization [46]. To our knowledge, [29] and [42] provide the only exact algorithms forlearning medium to large DAGs from continuous data. An IP-based model using the topologicalordering of variables is proposed in [29]. An A (cid:63) -lasso algorithm for learning an optimal DAG fromcontinuous data with an (cid:96) regularization is developed in [42]. A (cid:63) -lasso incorporates the lasso-basedscoring method within dynamic programming to avoid an exponential number of parent sets anduses the A (cid:63) algorithm to prune the search space of the dynamic programming method.Given the state of existing algorithms for DAG learning from continuous data, there is currentlya gap between theory and computation: While statistical properties of exact algorithms can berigorously analyzed [25, 41], it is much harder to assess the statistical properties of approximatealgorithms [1, 14, 17, 46] that oﬀer no optimality guarantees [21]. This gap becomes particularlynoticeable in cases where the statistical model is identiﬁable from observational data. In this case,2he optimal score from exact search algorithms is guaranteed to reveal the true underlying DAGwhen the sample size is large. Therefore, causal structure learning from observational data becomesfeasible [25, 33].In this paper, we focus on DAG learning for an important class of BNs for continuous data,where causal relations among random variables are linear. More speciﬁcally, we consider DAGscorresponding to linear structural equations models (SEMs). In this case, network edges are as-sociated with the coeﬃcients of regression models corresponding to linear SEMs. Consequently,the score function can be explicitly encoded as a penalized negative log-likelihood function with anappropriate choice of regularization [29, 35, 41]. Hence, the process of computing scores (i.e., stage1) is completely bypassed, and a single-stage model can be formulated [42]. Moreover, in this case,IP formulations only require a polynomial, rather than exponential, number of variables, becauseeach variable can be deﬁned in the space of arcs instead of parent sets. Therefore, cardinalityconstraints on the size of parent sets, which are used in earlier methods to reduce the search spaceand may lead to suboptimal solutions, are no longer necessary. Contributions

In this paper, we develop tailored exact DAG learning methods for continuousdata from linear SEMs, and make the following contributions:– We develop a mathematical framework that can naturally incorporate prior structuralknowledge, when available. Prior structural knowledge can be supplied in the form of anundirected and possibly cyclic graph (super-structure). An example is the skeleton of theDAG, obtained by removing the direction of all edges in a graph. Another example isthe moral graph of the DAG, obtained by adding an edge between pairs of nodes withcommon children and removing the direction of all edges [39]. The skeleton and moralgraphs are particularly important cases, because they can be consistently estimated fromobservational data under proper assumptions [20, 25]. Such prior information limits thenumber of possible DAGs and improves the computational performance.– We discuss three mathematical formulations, namely, cutting plane (CP), linear ordering(LO), and topological ordering (TO) formulations, for learning an optimal DAG, using both (cid:96) and (cid:96) -penalized likelihood scores. We also propose a new mathematical formulation tolearn an optimal DAG from continuous data, the layered network (LN) formulation, and es-tablish that other DAG learning formulations entail a smaller continuous relaxation feasibleregion compared to that of the continuous relaxation of the LN formulation (Propositions 3and 4). Nonetheless, all formulations attain the same optimal continuous relaxation objec-tive function value under a mild condition (Propositions 5 and 6). Notably, the number ofbinary variables and constraints in the LN formulation solely depend on the number of edgesin the super-structure (e.g., moral graph). Thus, the performance of the LN formulationsubstantially improves in the presence of a sparse super-structure. The LN formulation hasa number of other advantages; it is a compact formulation in contrast to the CP formula-tion; its relaxation can be solved much more eﬃciently compared with the LO formulation;and it requires fewer binary variables and explores fewer branch-and-bound nodes thanthe TO formulation. Our empirical results aﬃrm the computational advantages of the LNformulation. They also demonstrate that the LN formulation can ﬁnd a graph closer tothe true underlying DAG. These improvements become more noticeable in the presence ofa prior super-structure (e.g., moral graph).3 We compare the IP-based method and the A (cid:63) -lasso algorithm for the case of (cid:96) regulariza-tion. As noted earlier, there is no clear winner among A (cid:63) -style algorithms and IP-basedmodels for DAG learning from discrete data [26]. Thus, a wide range of approaches basedon dynamic programming, A (cid:63) algorithm, and IP-based models have been proposed fordiscrete data. In contrast, for DAG learning from continuous data with complete super-structure, the LN formulation remains competitive with the state-of-art A (cid:63) -lasso algorithmfor small graphs, whereas it performs better for larger problems. Moreover, LN performssubstantially better when a sparse super-structure is available. This is mainly because theLN formulation directly deﬁnes the variables based on the super-structure, whereas theA (cid:63) -lasso algorithm cannot take advantage of the prior structural knowledge as eﬀectivelyas the LN formulation.In Section 2, we outline the necessary preliminaries, deﬁne the DAG structure learning problem,and present a general framework for the problem. Mathematical formulations for DAG learning arepresented in Section 3. The strengths of diﬀerent optimization problems are discussed in Section4. Empirical results are presented in Sections 5 and 6. We end the paper with a summary anddiscussion of future research in Section 7. The causal eﬀect of (continuous) random variables in a DAG G can be described by SEMs thatrepresent each variable as a (nonlinear) function of its parents. The general form of these modelsis given by [31] X k = f k (cid:0) pa G k , δ k (cid:1) , k = 1 , . . . , m, (1)where X k is the random variable associated with node k ; pa G k denotes the parents of node k in G ,i.e., the set of nodes with arcs pointing to node k ; m is the number of nodes; and latent randomvariables, δ k represent the unexplained variation in each node.An important class of SEMs is deﬁned by linear functions, f k ( · ), which can be described by m linear regressions of the form X k = (cid:88) j ∈ pa G k β jk X j + δ k , k = 1 , . . . , m, (2)where β jk represents the eﬀect of node j on k for j ∈ pa G k . In the special case where the randomvariables are Gaussian, Equations (1) and (2) are equivalent, in the sense that β jk are coeﬃcientsof the linear regression model of X k on X j , j ∈ pa G k , and β jk = 0 for j / ∈ pa G k [35]. However,estimation procedures proposed in this paper are not limited to Gaussian random variables andapply more generally to linear SEMs [25, 35].Let M = ( V, E ) be an undirected and possibly cyclic super-structure graph with node set V = { , , . . . , m } and edge set E ⊆ V × V . From M , generate a bi-directional graph −→M = ( V, −→ E )where −→ E = { ( j, k ) , ( k, j ) | ( j, k ) ∈ E } . Throughout the paper, we refer to directed edges as arcs andto undirected edges as edges .Consider n i.i.d. observations from the linear SEM (2). Let X = ( X , . . . , X m ) be the n × m data matrix with n rows representing i.i.d. samples, and m columns representing random variables.4he linear SEM (2) can be compactly written as X = X B + ∆ , (3)where B = [ β ] ∈ R m × m is a matrix with β kk = 0 for k = 1 , . . . , m and β jk = 0 for all ( j, k ) / ∈ −→ E ; ∆is the n × m noise matrix. More generally, B deﬁnes a directed graph G ( B ) on m nodes such thatarc ( j, k ) appears in G ( B ) if and only if β jk (cid:54) = 0.Let σ k denote the variance of δ k ; k = 1 , , , . . . , m . We assume that all noise variables haveequal variances, i.e., σ k = σ . This condition implies the identiﬁability of DAGs from Gaussian [33]and non-Gaussian [25] data. Under this condition, the negative log likelihood for linear SEMs withGaussian or non-Gaussian noise is proportional to l n ( β ) = 12 tr { ( I − B )( I − B ) T S } , (4)where S = X T X and I is the identity matrix [25].In practice, we are often interested in learning sparse DAGs. Thus, a regularization term is usedto obtain a sparse estimate. For the linear SEM (2), the optimization problem corresponding tothe penalized negative log-likelihood with super-structure M (PNL M ) for learning sparse DAGsis given by PNL M min B ∈ R m × m l n ( β ) + λφ ( B ) , (5a)s.t., G ( B ) induces a DAG from −→M . (5b)The objective function (5a) consists of two parts: the quadratic loss function, l n ( β ), from (4),and the regularization term, φ ( B ). Popular choices for φ ( B ) include (cid:96) -regularization or lasso[40], φ ( B ) = (cid:80) ( j,k ) ∈ E → | β jk | , and (cid:96) -regularization, φ ( B ) = (cid:80) ( j,k ) ∈−→ E ( β jk ), where ( β jk ) = 1if β jk (cid:54) = 0, and 0 otherwise. The tuning parameter λ controls the degree of regularization. Theconstraint (5b) stipulates that the resulting directed subgraph (digraph) has to be an induced DAGfrom −→M .When the super-structure M is a complete graph, PNL M reduces to the classical PNL. Inthis case, the consistency of sparse PNL for DAG learning from Gaussian data with an (cid:96) penaltyfollows from an analysis similar to [41]. In particular, we have pr ( ˆ G n = G ) → n → ∞ ) , where ˆ G n is the estimate of the true structure G . An important advantage of the PNL estimationproblem is that its consistency does not require the (strong) faithfulness assumption [33, 41].The mathematical model (5) incorporates a super-structure M (e.g., moral graph) into thePNL model. When M is the moral graph, the consistency of sparse PNL M follows from theanalysis in [25], which studies the consistency of the following two-stage framework for estimatingsparse DAGs: (1) infer the moral graph from the support of the inverse covariance matrix; and (2)choose the best-scoring induced DAG from the moral graph. The authors investigate conditions foridentiﬁability of the underlying DAG from observational data and establish the consistency of thetwo-stage framework.While PNL and PNL M enjoy desirable statistical properties for linear SEMs with Gaussian [41]and non-Gaussian [25] noise, the computational challenges associated with these problems have notbeen fully addressed. This paper aims to bridge this gap.5234 5 6 1234 5 6Figure 1: A super-structure graph M (left) and a tournament (right) with six nodes which doesnot contain any cycles. Prior to presenting mathematical formulations for solving PNL M , we discuss a property of DAGlearning from continuous data that distinguishes it from the corresponding problem for discretedata. To present this property in Proposition 1, we need a new deﬁnition. Deﬁnition 1.

A tournament is a directed graph obtained by specifying a direction for each edgein the super-structure graph M (see Figure 1). Proposition 1.

There exists an optimal solution G ( B ) to PNL M (5) with an (cid:96) (or an (cid:96) ) regu-larization that is a cycle-free tournament. All proofs are given in Appendix I. Proposition 1 implies that for DAG learning from continuousvariables, the search space reduces to acyclic tournament structures. This is a far smaller searchspace when compared with the super-exponential O (cid:16) m !2( m ) (cid:17) search space of DAGs for discretevariables. However, one has to also identify the optimal β parameters. This search for optimal β parameters is critical, as it further reduces the super-structure to the edges of the DAG by removingthe edges with zero β coeﬃcients.A solution method based on brute force enumeration of all tournaments requires m ! γ computa-tional time when M is complete, where γ denotes the computational time associated with solvingPNL M given a known tournament structure. This is because when M is complete, the total num-ber of tournaments (equivalently the total number of permutations) is m !. However, when M isincomplete, the number of DAGs is fewer than m !. The topological search space is m ! regardlessof the structure of M and several topological orderings can correspond to the same DAG. The TOformulation [29] is based on this search space. In Section 3.2, we discuss a search space based onthe layering of a DAG, which uniquely identiﬁes a DAG, and propose the corresponding LayeredNetwork (LN) formulation, which eﬀectively utilizes the structure of M . We ﬁrst discuss existingmathematical formulations for the PNL M optimization problem (5) in the next section. Giventhe desirable statistical properties of (cid:96) regularization [41] and the fact that existing mathematicalformulations are given for (cid:96) regularization, we present the formulations for (cid:96) regularization. Weoutline the necessary notation below. 6 ndices V = { , , . . . , m } : index set of random variables D = { , , . . . , n } : index set of samples Input M = ( V, E ): an undirected super-structure graph (e.g., the moral graph) −→M = ( V, −→ E ): the bi-directional graph corresponding to the undirected graph MX = ( X , . . . , X m ), where X v = ( x v , x v , . . . , x nv ) (cid:62) and x dv denotes d th sample ( d ∈ D ) of randomvariable X v λ : tuning parameter (penalty coeﬃcient) Continuous optimization variables β jk : weight of arc ( j, k ) representing the regression coeﬃcients ∀ ( j, k ) ∈ −→ E Binary optimization variables z jk = 1 if arc ( j, k ) exists in a DAG; otherwise 0 , ∀ ( j, k ) ∈ −→ Eg jk = 1 if β jk (cid:54) = 0; otherwise 0 , ∀ ( j, k ) ∈ −→ E The main source of diﬃculty in solving PNL M is due to the acyclic nature of DAG imposed bythe constraint in (5b). A popular technique for ensuring acyclicity is to use cycle eliminationconstraints, which were ﬁrst introduced in the context of the Traveling Salesman Problem (TSP)in [11].Let C be the set of all possible cycles and C A ∈ C be the set of arcs deﬁning a cycle and deﬁne F ( β, g ) := n − (cid:80) k ∈ V (cid:80) d ∈D (cid:16) x dk − (cid:80) ( j,k ) ∈−→ E β jk x dj (cid:17) + λ (cid:80) ( j,k ) ∈−→ E g jk . Then, the (cid:96) -PNL M modelcan be formulated as (cid:96) -CP min F ( β, g ) (6a) − M g jk ≤ β jk ≤ M g jk , ∀ ( j, k ) ∈ −→ E , (6b) (cid:88) ( j,k ) ∈ C A g jk ≤ |C A | − , ∀C A ∈ C , (6c) g jk ∈ { , } , ∀ ( j, k ) ∈ −→ E . (6d)Following [29], the objective function (6a) is an expanded version of l n ( β ) in PNL M (multipliedby 2 n − ) with an (cid:96) regularization. The constraints in (6b) stipulate that β jk (cid:54) = 0 only if z jk = 1,where M is a suﬃciently large constant. The constraints in (6c) rule out all cycles. Note that for |C A | = 2, constraints in (6c) ensure that at most one arc exists among two nodes. The last set ofconstraints speciﬁes the binary nature of the decision vector g . Note that β variables are continuousand unrestricted; however, in typical applications, they can be bounded by a ﬁnite number M . Thisformulation requires |−→ E | binary variables and an exponential number of constraints. A cutting planemethod [28] that adds the cycle elimination inequalities as needed is often used to solve this problem.We refer to this formulation as the cutting plane (CP) formulation.7 emark . For a complete super-structure M , it suﬃces to impose the set of constraints in (6c)only for cycles of size 2 and 3 given by g ij + g jk + g ki ≤ , ∀ i, j, k ∈ V, i (cid:54) = j (cid:54) = k,g jk + g kj ≤ , ∀ j, k ∈ V, j (cid:54) = k. In other words, the CP formulation (both with (cid:96) and (cid:96) regularizations) needs a polynomial numberof constraints for complete super-structure M .The second formulation is based on a well-known combinatorial optimization problem, known as linear ordering (LO) [16]. Given a ﬁnite set S with q elements, a linear ordering of S is a permutation P ∈ S q where S q denotes the set of all permutations with q elements. In the LO problem, the goalis to identify the best permutation among m nodes. The “cost” for a permutation P depends on theorder of the elements in a pairwise fashion. Let p j denote the order of node j ∈ V in permutation P . Then, for two nodes j, k ∈ { , . . . , m } , the cost is c jk if the order of node j precedes the order ofnode k ( p j ≺ p k ) and is c kj otherwise ( p j (cid:31) p k ). A binary variable w jk indicates whether p j ≺ p k .Because w jk + w kj = 1 and w jj = 0, one only needs (cid:0) m (cid:1) variables to cast the LO problem as an IPformulation [16].The LO formulation for DAG learning from continuous data has two noticeable diﬀerences com-pared with the classical LO problem: (i) the objective function is quadratic, and (ii) an additionalset of continuous variables, i.e., β s, is added. Cycles are ruled out by directly imposing the linearordering constraints. The PNL M can be formulated as (8). (cid:96) -LO min F ( β, g ) , (8a) − M g jk ≤ β jk ≤ M g jk , ∀ ( j, k ) ∈ −→ E , (8b) g jk ≤ w jk , ∀ ( j, k ) ∈ −→ E , (8c) w jk + w kj = 1 , ∀ j, k ∈ V, j (cid:54) = k, (8d) w ij + w jk + w ki ≤ , ∀ i, j, k ∈ V, i (cid:54) = j (cid:54) = k, (8e) w jk ∈ { , } , ∀ j, k ∈ V, j (cid:54) = k, (8f) g jk ∈ { , } , ∀ ( j, k ) ∈ −→ E . (8g)The interpretation of constraints (8b)-(8d) is straightforward. The constraints in (8c) imply that ifnode j appears after node k in a linear ordering ( w jk = 0), then there should not exist an arc from j to k ( g jk = 0). The set of inequalities (8e) implies that if p i ≺ p j and p j ≺ p k , then p i ≺ p k . Thisensures the linear ordering of nodes and removes cycles.The third approach for ruling out cycles is to impose a set of constraints such that the nodesfollow a topological ordering. A topological ordering is a linear ordering of the nodes of a graphsuch that the graph contains an arc ( j, k ) if node j appears before node k in the linear order. Deﬁnedecision variables o rs ∈ { , } for all r, s ∈ { , . . . , m } . This variable takes value 1 if topologicalorder of node r (i.e., p r ) equals s , and 0, otherwise. If a topological ordering is known, the DAGstructure can be eﬃciently learned in polynomial time [35], but the problem remains challengingwhen the ordering is not known. The topological ordering prevents cycles in the graph. Thisproperty is used in [29] to model the problem of learning a DAG with (cid:96) regularization. We extend823 4 (a)

123 4 (b)

Figure 2: The role of the binary decision variables in (cid:96) regularization: Using z (instead of g ) inthe objective function creates a graph similar to (b) and counts the number of arcs instead of thenumber of non-zero β s in (b).their formulation to (cid:96) regularization. The topological ordering (TO) formulation is given by (cid:96) -TO min F ( β, g ) , (9a) − M g jk ≤ β jk ≤ M g jk , ∀ ( j, k ) ∈ −→ E , (9b) g jk ≤ z jk , ∀ ( j, k ) ∈ −→ E , (9c) z jk + z kj ≤ , ∀ ( j, k ) ∈ −→ E , (9d) z jk − mz kj ≤ (cid:88) s ∈ V s ( o ks − o js ) , ∀ ( j, k ) ∈ −→ E , (9e) (cid:88) s ∈ V o rs = 1 , ∀ r ∈ V, (9f) (cid:88) r ∈ V o rs = 1 , ∀ s ∈ V, (9g) z jk ∈ { , } , ∀ ( j, k ) ∈ −→ E , (9h) o rs ∈ { , } , ∀ r, s ∈ { , , . . . , m } , (9i) g jk ∈ { , } , ∀ ( j, k ) ∈ −→ E . (9j)In this formulation, z jk is an auxiliary binary variable which takes value 1 if an arc exists fromnode j to node k . Recall that, g jk = 1 if | β jk | >

0. The constraints in (9c) enforce the correctlink between g jk and z jk , i.e., g jk has to take value zero if z jk = 0. The constraints in (9d) implythat there should not exist a bi-directional arc among two nodes. This inequality can be replacedwith equality (Corollary 1). The constraints in (9e) remove cycles by imposing an ordering amongnodes. The set of constraints in (9f)-(9g) assigns a unique topological order to each node. The lasttwo sets of constraints indicate the binary nature of decision variables o and z .9orollary 1, which is a direct consequence of Proposition 1, implies that we can use z jk = 1 − z kj for all j < k and reduce the number of binary variables. Corollary 1.

The constraints in (9d) can be replaced by z jk + z kj = 1 , ∀ ( j, k ) ∈ −→ E . In constraints (9b)-(9i), both variables z and g are needed to correctly model the (cid:96) regularizationterm in the objective function (see Figure 2a). This is because the constraints (9e) satisfy thetransitivity property: if z ij = 1 for ( i, j ) ∈ −→ E and z jk = 1 for ( j, k ) ∈ −→ E , then z ik = 1 for ( i, k ) ∈ −→ E ,since z ij = 1 implies that (cid:80) s ∈ V s ( o js − o is ) ≥

1; similarly, z jk = 1 implies (cid:80) s ∈ V s ( o ks − o js ) ≥

1. Ifwe sum both inequalities, we have (cid:80) s ∈ V s ( o ks − o is ) ≥

2, which enforces z ik = 1. Such a transitivityrelation, however, need not hold for the decision vector g . In other words, the decision variable g jk is used to keep track of the number of non-zero weights β jk associated with the arc ( j, k ), and thedecision vector z is used to remove cycles via the set of constraints in (9e) by creating an acyclictournament on the super-structure M . A tournament on super-structure M assigns a directionfor each edge in an undirected super-structure M . In other words, if we were to let g ij = z ij for( i, j ) ∈ −→ E and hence use the decision variable z in the objective, then we would be counting thenumber of edges, equal to | E | , instead of number of non-zero β values. As an alternative to the existing mathematical formulations, we propose a new formulation forimposing acyclicity constraints that is motivated by the layering of nodes in DAGs [18]. Morespeciﬁcally, our formulation ensures that the resulting graph is a layered network , in the sense thatthere exists no arc from a layer v to layer u , where u < v . Let ψ k be the layer value for node k .One may interpret ψ k as (cid:80) ms =1 s o ks for all k ∈ V , where the variables o ks are as deﬁned in the TOformulation. However, note that the notion of ψ k is more general because ψ k need not be integer.Figure 3 depicts the layered network encoding of a DAG. With this notation, our layered network (LN) formulation can be written asmin F ( β, g ) , (10a) − M g jk ≤ β jk ≤ M g jk , ∀ ( j, k ) ∈ −→ E , (10b) g jk ≤ z jk , ∀ ( j, k ) ∈ −→ E , (10c) z jk + z kj = 1 , ∀ ( j, k ) ∈ −→ E , (10d) z jk − ( m − z kj ≤ ψ k − ψ j , ∀ ( j, k ) ∈ −→ E , (10e) z jk ∈ { , } , ∀ ( j, k ) ∈ −→ E , (10f)1 ≤ ψ k ≤ m, ∀ k ∈ V, (10g) g jk ∈ { , } , ∀ ( j, k ) ∈ −→ E . (10h)The interpretation of the constraints (10b)-(10c) is straightforward. The constraints in (10e)ensure that the graph is a layered network. The last set of constraints indicates the continuousnature of the decision variable ψ and gives the tightest valid bound for ψ . It suﬃces to considerany real number for layer values ψ as long as layer values of any two nodes diﬀer by at least one if10ayer m with parameter m − m − m nodes. The next proposition establishes thevalidity of the LN formulation. Proposition 2.

An optimal solution to (10) is an optimal solution to (5) . The LN formulation highlights a desirable property of the layered network representation of aDAG in comparison to the topological ordering representation. Let us deﬁne nodes in layer 1 asthe set of nodes that have no incoming arcs in the DAG, nodes in layer 2 as the set of nodes thathave incoming arcs only from nodes in the layer 1, layer 3 as the set of nodes that have incomingarcs from layer 2 (and possibly layer 1), etc. (see Figure 3). The minimal layer number of a nodeis the length of the longest directed path from any node in layer 1 to that node. For a given DAG,there is a unique minimal layer number, but not a unique topological order. As an example, Figure2a has three valid topological orders: (i) 1,2,4,3, (ii) 1,4,2,3, and (iii) 4,1,2,3. In contrast, it has aunique layer representation, 1,2,3,1.There is a one-to-one correspondence between minimal layer numbering and a DAG. However,the solutions of the LN formulation, i.e., ψ variables (layer values), do not necessarily correspondto the minimal layer numbering. This is because the LN formulation does not impose additionalconstraints to enforce a minimal numbering and can output solutions that are not minimally num-bered. However, because branch-and-bound does not branch on continuous variables, alternative(non-minimal) feasible solutions for the ψ variables do not impact the branch-and-bound process.On the contrary, we have multiple possible representations of the same DAG with topological or-dering. Because topological ordering variables are binary, the branch-and-bound method appliedto the TO formulation explores multiple identical DAGs as it branches on the topological orderingvariables. This enlarges the size of the branch-and-bound tree and increases the computationalburden.Layered network representation also has an important practical implication: Using this repre-sentation, the search space can be reduced to the total number of ways we can layer a network (orequivalently the total number of possible minimal layer numberings) instead of the total number of11able 1: The number of binary variables and the number of constraints Incomplete (moral) M Complete (moral) M CP LO TO LN CP LO TO LN (cid:96) ) |−→ E | |−→ E | + (cid:0) m (cid:1) m + | E | + |−→ E | | E | + |−→ E | (cid:0) m (cid:1) (cid:0) m (cid:1) m + 3 (cid:0) m (cid:1) (cid:0) m (cid:1) (cid:96) ) | E | (cid:0) m (cid:1) m + | E | | E | (cid:0) m (cid:1) (cid:0) m (cid:1) m + (cid:0) m (cid:1) (cid:0) m (cid:1) (cid:96) and (cid:96) ) Exp (cid:0) m (cid:1) |−→ E | + 2 m |−→ E | (cid:0) m (cid:1) (cid:0) m (cid:1) (cid:0) m (cid:1) + 2 m (cid:0) m (cid:1) topological orderings. When the super-structure M is complete, both quantities are the same, andequal to m !. Otherwise, a brute-force search for ﬁnding the optimal DAG has computational time L γ , where L denotes the total number of minimal layered numberings, and γ is the computationalcomplexity of solving PNL M given a known tournament structure.We close this section by noting that a set of constraints similar to (10e) was introduced in [7]for learning pedigree. However, the formulation in [7] requires an exponential number of variables.In more recent work on DAGs for discrete data, Cussens and colleagues have focused on a tighterformulation for removing cycles, known as cluster constraints [8, 9, 10]. To represent the set ofcluster constraints, variables have to be deﬁned according to the parent set choice leading to anexponential number of variables. Thus, such a representation is not available in the space of arcs. (cid:96) regularization Because of its convexity, the structure learning literature has utilized the (cid:96) -regularization for learn-ing DAGs from continuous variables [17, 29, 35, 42, 46]. The LN formulation with (cid:96) -regularizationcan be written asmin 1 n (cid:88) i ∈ I (cid:88) k ∈ V ( x ik − (cid:88) ( j,k ) ∈ E → β jk x ij ) + λ (cid:88) ( j,k ) ∈−→ E | β jk | , (11a) − M z jk ≤ β jk ≤ M z jk ∀ ( j, k ) ∈ −→ E , (11b)(10e) − (10g) . Remark . For a complete super-structure M , ψ k = (cid:80) j ∈ V \ k z jk ∀ k ∈ V . Thus, the LN formulations(both (cid:96) and (cid:96) ) can be encoded without ψ variables by writing (10e) as z jk − ( m − z kj ≤ (cid:88) j ∈ V \ k z jk − (cid:88) k ∈ V \ j z kj ∀ j, k ∈ V j (cid:54) = k. Remark . CP and LO formulations reduce to the same formulation for (cid:96) regularization when thesuper-structure M is complete by letting w ij = g ij in formulation (8) for all ( j, k ) ∈ −→ E .An advantage of the (cid:96) -regularization for DAG learning is that all models (CP, LO, TO andLN) can be formulated without decision variables g jk , since counting the number of non-zero β jk is no longer necessary.Table 1 shows the number of binary variables and the number of constraints associated withcycle prevention constraints in each model. Evidently, (cid:96) models require additional binary variables12 • •••• ••••••• •• •••••••• •• • •• Figure 4: Continuous relaxation regions of three IP models. The tightest model (convex hull)is represented by the black polygon strictly inside other polygons. The blue and red polygonsrepresent valid yet weaker formulations. The black points show the feasible integer points for allthree formulations.compared to the corresponding (cid:96) models. Note that the number of binary variables and constraintsfor the LN formulation solely depend on the number of edges in the super-structure M . Thisproperty is particularly desirable when the super-structure M is sparse. The LN formulation requiresthe fewest number of constraints among all models. The LN formulation also requires fewer binaryvariables than the TO formulation. More importantly, diﬀerent topological orders for the sameDAG are symmetric solutions to the associated TO formulation. Consequently, branch-and-boundrequires exploring multiple symmetric formulations as it branches on fractional TO variables. Asfor the LO formulation, the number of constraints is O ( m ) which makes its continuous relaxationcumbersome to solve in the branch-and-bound process. The LN formulation is compact, whereas theCP formulation requires an exponential number of constraints for incomplete super-structure M .The CP formulation requires fewer binary variables for (cid:96) formulation than LN; both formulationsneed the least number of binary variables for (cid:96) regularization.In the next section, we discuss the theoretical strength of these mathematical formulations andprovide a key insight on why the LN formulation performs well for learning DAGs from continuousdata. One of the fundamental concepts in IP is relaxations, wherein some or all constraints of a prob-lem are loosened. Relaxations are often used to obtain a sequence of easier problems which canbe solved eﬃciently yielding bounds and approximate, not necessarily feasible, solutions for theoriginal problem. Continuous relaxation is a common relaxation obtained by relaxing the binaryvariables of the original mixed-integer quadratic program (MIQP) and allowing them to take realvalues. Continuous relaxation is at the heart of branch-and-bound methods for solving MIQPs. Animportant concept when comparing diﬀerent MIQP formulations is the strength of their continuousrelaxations.

Deﬁnition 2.

A formulation A is said to be stronger than formulation B if R ( A ) ⊂ R ( B ) where R ( A ) and R ( B ) correspond to the feasible regions of continuous relaxations of A and B , respectively.13 roposition 3. The LO formulation is stronger than the LN formulation, that is, R ( LO ) ⊂ R ( LN ) . Proposition 4.

When the parameter m in (9e) is replaced with m − , the TO formulation isstronger than the LN formulation, that is, R ( T O ) ⊂ R ( LN ) . These propositions are somewhat expected because the LN formulation uses the fewest numberof constraints. Hence, the continuous relaxation feasible region of the LN formulation is loos-ened compared to the other formulations. The next two results justify the advantages of the LNformulation.

Proposition 5.

Let β (cid:63)jk denote the optimal coeﬃcient associated with an arc ( j, k ) ∈ −→ E from (5) .For both (cid:96) and (cid:96) regularizations, the initial continuous relaxations of the LN formulation attainas tight an optimal objective function value as the LO, CP, TO formulations if M ≥ ( j,k ) ∈−→ E | β (cid:63)jk | . Proposition 5 states that although the LO and TO formulations are tighter than the LN formu-lation with respect to the feasible region of their continuous relaxations, the continuous relaxationof all models attain the same objective function value (root relaxation).

Proposition 6.

For the same variable branching in the branch-and-bound process, the continuousrelaxations of the LN formulation for both (cid:96) and (cid:96) regularizations attain as tight an optimalobjective function value as LO, CP and TO, if M ≥ ( j,k ) ∈−→ E | β (cid:63)jk | . Proposition 6 is at the crux of this section. It shows that not only does the tightness of theoptimal objective function value of the continuous relaxation hold for the root relaxation, but italso holds throughout the branch-and-bound process under the speciﬁed condition on M , if thesame branching choices are made. Thus, the advantages of the LN formulation are due to the factthat it is a compact formulation that entails the fewest number of constraints, while attaining thesame optimal objective value of continuous relaxation as tighter models.In practice, ﬁnding a tight value for M is diﬃcult. Our computational results show that theapproach suggested in [29] to obtain a value of M , which is explained in Section 5 and used inour computational experiments, always satisﬁes the condition in Proposition 6 across all generatedinstances. We present numerical results comparing the proposed LN formulation with existing approaches.Experiments are performed on a cluster operating on UNIX with Intel Xeon E5-2640v4 2.4GHz.All MIQP formulations are implemented in the Python programming language. Gurobi 8.0 is usedas the MIQP solver. A time limit of 50 m (in seconds), where m denotes the number of nodes, isimposed across all experiments after which runs are aborted. Unless otherwise stated, an MIQPoptimality gap of 0 .

001 is imposed across all experiments; the gap is calculated by UB − LBUB where UB denotes the objective value associated with the best feasible integer solution (incumbent) and LB represents the best obtained lower bound during the branch-and-bound process.For CP, instead of incorporating all constraints given by (6c), we begin with no constraintof type (6c). Given an integer solution with cycles, we detect a cycle and impose a new cycle14revention constraint to remove the detected cycle. Depth First Search (DFS) can detect a cycle ina directed graph with complexity O ( | V | + | E | ). Gurobi Lazy Callback is used, which allows addingcycle prevention constraints in the branch-and-bound algorithm, whenever an integer solution withcycles is found. The same approach is used by [29]. Note that Gurobi solver follows a branch-and-cutimplementation and adds many general-purpose and special-purpose cutting planes.To select the M parameter in all formulations we use the proposal of [29]. Speciﬁcally, given λ , we solve each problem without cycle prevention constraints. We then use the upper bound M = 2 max ( j,k ) ∈−→ E | β jk | . The results provided in [29] computationally conﬁrm that this approach gives alarge enough value of M . We also conﬁrmed the validity of this choice across all our test instances. We use the R package pcalg to generate random Erd˝os-R´enyi graphs. Firstly, we create a DAGusing randomDAG function and assign random arc weights (i.e., β ) from a uniform distribution, U [0 . , ground truth DAG is used to assess the quality of estimates. Next, the resultingDAG and random coeﬃcients are input to the rmvDAG function, which uses linear regression as theunderlying model, to generate multivariate data (columns of matrix X ) with the standard normalerror distribution.We consider m ∈ { , , , } nodes and n ∈ { , } samples. The average outgoingdegree of each node, denoted by d , is set to 2. We generate 10 random graphs for each setting ( m , n , d ). The raw observational data, X , for the datasets with n = 100 is the same as ﬁrst 100 rowsof the datasets with n = 1000.We consider two types of problem instances: (i) a set of instances for which the moral graphcorresponding to the true DAG is available; (ii) a set of instances with a complete undirected graph,i.e., assuming no prior knowledge. The ﬁrst class of problems is referred to as moral instances,whereas the second class is called complete instances. The raw observational data, X , for moraland complete instances are the same. The function moralize(graph) in the pcalg R-package isused to generated the moral graph from the true DAG. The moral graph can also be (consistently)estimated from data using penalized estimation procedures with polynomial complexity [20, 25].However, since the quality of the moral graph equally aﬀects all optimization models, the true moralgraph is used in our experiments.We use the following IP-based metrics to measure the quality of a solution: Optimality gap(MIQP GAP), computation time in seconds (Time), Upper Bound (UB), Lower Bound (LB), com-putational time of root continuous relaxation (Time LP), and the number of explored nodes in thebranch-and-bound tree.We also evaluate the quality of the estimated DAGs by comparing them with the ground truth.To this end, we use the average structural Hamming distance (SHD), as well as true positive(TPR) and false positive rates (FPR). These criteria evaluate diﬀerent aspects of the quality ofthe estimated DAGs: SHD counts the number of diﬀerences (addition, deletion, or arc reversal)required to transform predicted DAG to the true DAG; TPR is the number of correctly identiﬁedarcs divided by the total number of true arcs, P ; FPR is the number of incorrectly identiﬁed arcsdivided by the total number of negatives (non-existing arcs), N . For brevity, TPR and FPR plotsare presented in Appendix II. 15 .2 Comparison of (cid:96) formulations Figure 5 reports the average metrics across 10 random graphs for (cid:96) formulations with n = 1000.The LO formulation fails to attain a reasonable solution for one graph (out of 10) with m = 40 and λ ∈ { . , } . This is due to the large computation time for solving its continuous relaxation. Weexcluded these two instances from LO results.Figure 5(a) shows that the LN formulation outperforms other formulations in terms of the av-erage optimality gap across all number of nodes m ∈ { , , , } and regularization parameters, λ ∈ { . , } . The diﬀerence becomes more pronounced for moral instances. For moral instances,the number of binary variables and constraints for LN solely depends on the size of moral graph.Figure 5(b) also indicates that the LN formulation requires the least computational time for smallinstances, whereas all models hit the time limit for larger instances.Figures 5(c)-(d) show the performance of all methods in terms of their upper and lower bounds.For easier instances (e.g., complete instances with m ∈ { , } and moral instances), all methodsattain almost the same upper bound. Nonetheless, LN performs better in terms of improving thelower bound. For more diﬃcult instances, LN outperforms other methods in terms of attaining asmaller upper bound (feasible solution) and a larger lower bound.Figures 5(e)-(f) show the continuous relaxation time of all models, and the number of explorednodes in the branch-and-bound tree, respectively. The fastest computational time for the continuousrelaxation is for the TO formulation followed by the LN formulation. However, the number ofexplored nodes provides more information about the performance of mathematical formulations.In small instances, i.e., m = 10, where an optimal solution is attained, the size of the branch-and-bound tree for the LN formulation is smaller than the TO formulation. This is because the TOformulation has a larger number of binary variables, leading to a larger branch-and-bound tree.On the other hand, for large instances, the number of explored nodes in the LN formulation islarger than the TO formulation. This implies that the LN formulation explores more nodes in thebranch-and-bound tree given a time limit. This may be because continuous relaxations of the LNformulation are easier to solve in comparison to the continuous relaxations of the TO formulation inthe branch-and-bound process. As stated earlier, the branch-and-bound algorithm needs to exploremultiple symmetric formulations in the TO formulation as it branches on fractional topologicalordering variables. This degrades the performance of the TO formulation. The LO formulationis very slow because its continuous relaxation becomes cumbersome as the number of nodes, m ,increases. Thus, we can see a substantial decrease in the number of explored nodes in branch-and-bound trees associated with the LO formulation. The CP formulation is implemented in acutting-plane fashion. Hence, its number of explored nodes is not directly comparable with otherformulations.Figures 5(a)-(f) show the importance of incorporating available structural knowledge (e.g., moralgraph). The average optimality gap and computational time are substantially lower for moralinstances compared to complete instances. Moreover, the substantial diﬀerence in the optimalitygap elucidates the importance of incorporating structural knowledge. Similar results are obtainedfor n = 100 samples; see Appendix II.We next discuss the performance of diﬀerent methods in terms of estimating the true DAG.The choice of tuning parameter λ , the number of samples n , and the quality of the best feasiblesolution (i.e., upper bound) inﬂuence the resulting DAG. Because our focus in this paper is oncomputational aspects, we ﬁxed the values of λ for a fair comparison between the formulations, andused λ = 0 . n = 1000 and n = 100, respectively.Comparing Figure 6(a) with Figure 6(b), we observe that the SHD tends to increase as the numberof samples decreases. As discussed earlier, when n → ∞ , penalized likelihood likelihood estimatewith an (cid:96) regularization ensures identiﬁability in our setting [33, 41]. However, for a ﬁnite samplesize, identiﬁability may not be guaranteed. Moreover, the appropriate choice of λ for n = 100 maybe diﬀerent than the corresponding λ for n = 1000.Figure 6(a) shows that all methods learn the true DAG with λ = 0 .

1, and given a moral graphfor m ∈ { , , } . In addition, SHD is negligible for LN and CP formulations for m = 40.However, we observe a substantial increase in SHD (e.g., from 0.2 to near 10 for LN) for completegraphs. These ﬁgures indicate the importance of incorporating available structural knowledge (e.g.,a moral graph) for better estimation of the true DAG.While, in general, LN performs well compared with other formulations, we do not expect tosee a clear dominance in terms of accuracy of DAG estimation either due to ﬁnite samples or thefact that none of the methods could attain a global optimal solution for larger instances. On thecontrary, we retrieve the true DAG for smaller graphs for which optimal solutions are obtained.As pointed out in [29], a slight change in the objective function value could signiﬁcantly alter theestimated DAG. Our results corroborate this observation. (cid:96) formulations Figure 7 shows various average metrics across 10 random graphs for (cid:96) regularization with n = 1000samples. Figure 7(a) shows that the LN formulation clearly outperforms other formulations in termsof average optimality gap across all number of nodes, m ∈ { , , , } , and regularization pa-rameters, λ ∈ { . , } . Moreover, Figure 7(b) shows that the LN formulation requires signiﬁcantlyless computational time in moral instances, and in complete instances with m ∈ { , } comparedto other methods. In complete instances, all methods hit the time limit for m ∈ { , } . Fig-ures 7(c)-(f) can be interpreted similar to the Figures 5(c)-(f) for (cid:96) regularization. Similar to (cid:96) regularization, Figures 7(a)-(b) demonstrate the importance of incorporating structural knowledge(e.g., a moral graph) for (cid:96) regularization. Similar results are observed for n = 100 samples; seeAppendix II.As expected, the DAG estimation accuracy with (cid:96) regularization is inferior to the (cid:96) regular-ization. This is in part due to the bias associated with the (cid:96) regularization, which could be furthercontrolled with, for example, adaptive (cid:96) -norm regularization [47]. Nonetheless, formulations for (cid:96) regularization require less computational time and are easier to solve than the correspondingformulations for (cid:96) regularization. (cid:63) -lasso algorithm In this section, we compare the LN formulation with A (cid:63) -lasso [42], using the MATLAB code madeavailable by the authors. For this comparison, the same true DAG structures are taken from [42]and the strength of arcs ( β ) are chosen from U [ − , − . ∪ U [0 . , m = 6 to m = 27 (see Table 2). The true DAG and resulting random β coeﬃcients are used to generate n = 500 samples for each column of data matrix X .A time limit of six hours is imposed across all experiments after which runs are aborted. Inaddition, for a fairer comparison with A (cid:63) -lasso, we do not impose an MIQP gap termination criterionof 0.001 for LN and use the Gurobi default optimality gap criterion of 0.0001.17 .000.050.100.150.200.25 O p t i m a li t y G A P ( M I P G A P ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes O p t i m a li t y G A P ( M I P G A P ) =0.1

10 20 30 40

Number of nodes =0.1 (a) Optimality GAPs for MIQPs T i m e ( s e c o n d s ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes T i m e ( s e c o n d s ) =0.1

10 20 30 40

Number of nodes =0.1 (b) Time (in seconds) for MIQPs U pp e r B o un d Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes U pp e r B o un d =0.1

10 20 30 40

Number of nodes =0.1 (c) Best upper bounds for MIQPs L o w e r B o un d Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes L o w e r B o un d =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (d) Best lower bounds for MIQPs T i m e ( s e c o n d s ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes T i m e ( s e c o n d s ) =0.1

10 20 30 40

Number of nodes =0.1 (e) Time (in seconds) for continuous root relaxation N u m b e r o f N o d e s i n B & B Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes N u m b e r o f N o d e s i n B & B =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (f) Number of explored nodes in B&B tree

Figure 5: Optimization-based measures for MIQPs for (cid:96) regularization with the number of samples n = 1000. 18 S t r u c t u r a l H a mm i n g D i s t a n c e Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes S t r u c t u r a l H a mm i n g D i s t a n c e =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (a) n = 1000 S t r u c t u r a l H a mm i n g D i s t a n c e Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes S t r u c t u r a l H a mm i n g D i s t a n c e =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (b) n = 100 Figure 6: Structural Hamming Distance (SHD) of MIQP estimates with (cid:96) regularization.Similar to synthetic data described in Section 5.1, we consider two cases: (i) moral instancesand (ii) complete instances. For the former case, the moral graph is constructed from the true DAGas done in Section 5.1. The raw observational data (i.e., X ) for moral and complete instances arethe same.We compare the LN formulation with A (cid:63) -lasso using (cid:96) regularization. Note that A (cid:63) -lasso cannotsolve the model with (cid:96) regularization. Furthermore, the original A (cid:63) -lasso algorithm assumes nosuper-structure. Therefore, to enhance its performance, we modiﬁed the MATLAB code for A (cid:63) -lasso in order to incorporate the moral graph structure, when available. We consider two valuesfor λ ∈ { , . } for our comparison. As λ decreases, identifying an optimal DAG becomes morediﬃcult. Thus, it is of interest to evaluate the computational performance of both methods for λ = 0 (i.e., no regularization) to assess the performance of these approaches on diﬃcult cases (see,e.g., the computational results in [46] and the statistical analysis in [25] for λ = 0). We note thatmodel selection methods (such as Bayesian Information Criterion [38]) can be used to identify thebest value of λ . However, in this section, our focus is to evaluate the computational performanceof these approaches for a given λ value.Table 2 shows the solution times (in seconds) of A (cid:63) -lasso versus the LN formulation for com-plete and moral instances. For the LN formulation, if the algorithm cannot prove optimality withinthe 6-hour time limit, we stop the algorithm and report, in parentheses, the optimality gap attermination. For complete instances with λ = 0, the results highlight that for small instances (upto 14 nodes) A (cid:63) -algorithm performs better, whereas the LN formulation outperforms A (cid:63) -lasso forlarger instances. In particular, we see that the LN formulation attains the optimal solution for theCloud data set in 810.47 seconds and it obtains a feasible solution that is provably within 99.5%of the optimal objective value for Funnel and Galaxy data sets. For moral instances, we observesigniﬁcant improvement in the computational performance of the LN formulation, whereas the im-provement in A (cid:63) -lasso is marginal in comparison. This observation highlights the fact that dynamicprogramming-based approaches cannot eﬀectively utilize the super-structure knowledge, whereasan IP-based approach, particularly the LN formulation, can signiﬁcantly reduce the computationaltimes. For instance, LN’s computational time for the Cloud data reduces from ∼

810 seconds to lessthan two seconds when the moral graph is provided. In contrast, the reduction in computational19 .000.050.100.150.20 O p t i m a li t y G A P ( M I P G A P ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes O p t i m a li t y G A P ( M I P G A P ) =0.1

10 20 30 40

Number of nodes =0.1 (a) Optimality GAPs for MIQPs T i m e ( s e c o n d s ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes T i m e ( s e c o n d s ) =0.1

10 20 30 40

Number of nodes =0.1 (b) Time (in seconds) for MIQPs U pp e r B o un d Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes U pp e r B o un d =0.1

10 20 30 40

Number of nodes =0.1 (c) Best upper bounds for MIQPs L o w e r B o un d Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes L o w e r B o un d =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (d) Best lower bounds for MIQPs T i m e ( s e c o n d s ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes T i m e ( s e c o n d s ) =0.1

10 20 30 40

Number of nodes =0.1 (e) Time (in seconds) for continuous root relaxation N u m b e r o f N o d e s i n B & B Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes N u m b e r o f N o d e s i n B & B =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (f) Number of explored nodes in B&B tree

Figure 7: Optimization-based measures for MIQPs for (cid:96) regularization with the number of samples n = 1000. 20 S t r u c t u r a l H a mm i n g D i s t a n c e Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes S t r u c t u r a l H a mm i n g D i s t a n c e =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (a) n = 1000 samples S t r u c t u r a l H a mm i n g D i s t a n c e Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes S t r u c t u r a l H a mm i n g D i s t a n c e =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (b) n = 100 samples Figure 8: Structural Hamming Distance (SHD) of MIQP estimates with (cid:96) regularization.time for A (cid:63) -algorithm with the moral graph is negligible. For λ = 0 .

1, the problem is easier tosolve. Nevertheless, LN performs well in comparison to A (cid:63) -lasso and the performance of the LNformulation improves for moral instances.For DAG learning from discrete data, an IP-based model (see e.g., [19]) outperforms A (cid:63) algo-rithms when a cardinality constraint on the number of the parent set for each node is imposed; A (cid:63) tends to perform better if such constraints are not enforced. This is mainly because an IP-basedmodel for discrete data requires an exponential number of variables which becomes cumbersomeif such cardinality constraints are not permitted. In contrast, for DAG learning from continuousdata with linear SEMs, our results show that an IP-based approach does not have such a limitationbecause variables are encoded in the space of arcs (instead of parent sets). That is why LN performswell even for complete instances (i.e., no restriction on the cardinality of parent set).There are several fundamental advantages of IP-based modeling, particularly the LN formula-tion, compared to A (cid:63) -lasso: (i) The variables in IP-based models (i.e., TO, LN, and CP) dependon the super-structure. Therefore, these IP-based models can eﬀectively utilize the prior knowl-edge to reduce the search space, whereas A (cid:63) -lasso cannot utilize the super-structure informationas eﬀectively. This is particularly important for the LN formulation, as the number of variablesand constraints depend only on the super-structure; (ii) all IP-based models can incorporate both (cid:96) and (cid:96) regularizations, whereas A (cid:63) -lasso can solve the problem only with (cid:96) regularization; (iii)all IP-based methods in general enjoy the versatility to incorporate a wide variety of structuralconstraints, whereas A (cid:63) -lasso and dynamic programming approaches cannot accommodate manystructural assumptions. For instance, a modeler may prefer restricting the number of arcs in theDAG, this is achievable by imposing a constraint on an IP-based model, whereas one cannot imposesuch structural knowledge on A (cid:63) -lasso algorithm; (iv) A (cid:63) -lasso is based on dynamic programming;therefore, one cannot abrupt the search with the aim of achieving a feasible solution. On the otherhand, one can impose a time limit or an optimality gap tolerance to stop the search process in abranch-and-bound tree. The output is then a feasible solution to the problem, which provides anupper bound as well as a lower bound which guarantees the quality of the feasible solution; (v)algorithmic advances in integer optimization alone (such as faster continuous relaxation solution,heuristics for better upper bounds, and cutting planes for better lower bounds) have resulted in21 Upper BoundLower Bound (a) Factors dataset with complete graph

Upper BoundLower Bound (b) Factor dataset with moral graph

Figure 9: The progress of upper bound versus lower bound in the branch-and-bound tree for theLN formulation with no regularization (i.e., λ = 0).29,000 factor speedup in solving IPs [4] using a branch-and-bound process. Many of these advanceshave been implemented in powerful state-of-the-art optimization solvers (e.g., Gurobi), but theycannot be used in dynamic programming methods, such as A (cid:63) -lasso.Figures 9(a)-(b) illustrate the progress of upper bound versus lower bound in the branch-and-bound process for the Factor dataset and highlight an important practical implication of an IP-based model: such models often attain high quality upper bounds (i.e., feasible solutions) in a shortamount of time whereas the rest of the time is spent to close the optimality gap by increasing thelower bound.The results for the moral graph in Figure 9(b) highlight another important observation. That is,providing the moral graph can signiﬁcantly accelerate the progress of the lower bound in the branch-and-bound process. In other words, information from the moral graph helps close the optimalitygap more quickly.Table 2: Computational performance of LN versus A (cid:63) -algorithm with (cid:96) regularization for λ ∈{ , . } Moral λ = 0 Complete λ = 0 Moral λ = 0 . λ = 0 . m |M| A (cid:63) -lasso LN A (cid:63) -lasso LN A (cid:63) -lasso LN A (cid:63) -lasso LNdsep 6 16 0.017 0.378 0.0261 0.388 0.455 0.025 0.429 0.108Asia 8 40 0.016 0.401 0.152 0.887 0.195 0.071 0.191 0.319Bowling 9 36 0.018 0.554 0.467 1.31 0.417 0.225 0.489 0.291Insurancesmall 15 76 391 2.719 547.171 613.543 2.694 1.135 3.048 0.531Rain 14 70 119.15 1.752 101.33 246.25 51.737 0.632 69.404 3.502Cloud 16 58 4433.29 1.421 18839 810.471 1066.08 0.426 2230.035 7.249Funnel 18 62 6 hrs 1.291 6 hrs 6 hrs (.002) 6 hrs 0.395 6 hrs 3.478Galaxy 20 76 6 hrs 1.739 6 hrs 6 hrs (.005) 6 hrs 0.740 6 hrs 9.615Insurance 27 168 6 hrs 131.741 6 hrs 6 hrs (.162) 6 hours 12.120 6 hrs 6 hrs (.031)Factors 27 310 6 hrs 6 hrs (.001) 6 hrs 6 hrs (.081) 6 hours 55.961 6 hrs 6 hrs (.01) Conclusion

In this paper, we study the problem of learning an optimal DAG from continuous observationaldata using a score function, where the causal eﬀect among the random variables is linear. We castthe problem as a mathematical program and use a penalized negative log-likelihood score functionwith both (cid:96) and (cid:96) regularizations. The mathematical programming framework can naturally in-corporate a wide range of structural assumptions. For instance, it can incorporate a super-structure(e.g., skeleton or moral graph) in the form of an undirected and possibly cyclic graph. Such super-structures can be estimated from observational data. We review three mathematical formulations:cutting plane (CP), topological ordering (TO), and Linear Ordering (LO), and propose a newmixed-integer quadratic optimization (MIQO) formulation, referred to as the layered network (LN)formulation. We establish that the continuous relaxations of all models attain the same optimalobjective function value under a mild condition. Nonetheless, the LN formulation is a compactformulation in contrast to CP, its relaxation can be solved much more eﬃciently compared to LO,and enjoys a fewer number of binary variables and traces a fewer number of branch-and-boundnodes than TO.Our numerical experiments indicate that these advantages result in considerable improvementin the performance of LN compared to other MIQP formulations (CP, LO, and TO). These im-provements are particularly pronounced when a sparse super-structure is available, because LN isthe only formulation in which the number of constraints and binary variables solely depend onthe super-structure. Our numerical experiments also demonstrate that the LN formulation hasa number of advantages over the A (cid:63) -lasso algorithm, especially when a sparse super-structure isavailable.At least two future research avenues are worth exploring. First, one of the diﬃculties of esti-mating DAGs using mathematical programming techniques is the constraints in (6b). The big- M constraint is often very loose, which makes the convergence of branch-and-bound process slow. It isof interest to study these constraints in order to improve the lower bounds obtained from continuousrelaxations. Second, in many real-world applications, the underlying DAG has special structures.For instance, the true DAG may be a polytree [12]. Another example is a hierarchical structure. Inthat case, it is natural to learn a DAG such that it satisﬁes the hierarchy among diﬀerent groupsof random variables. This problem has important applications in discovering the genetic basis ofcomplex diseases (e.g., asthma, diabetes, atherosclerosis). References [1] Bryon Aragam and Qing Zhou. Concave penalized estimation of sparse Gaussian Bayesiannetworks.

Journal of Machine Learning Research , 16:2273–2328, 2015.[2] Mark Bartlett and James Cussens. Advances in Bayesian network learning using integer pro-gramming. arXiv preprint arXiv:1309.6825 , 2013.[3] Mark Bartlett and James Cussens. Integer linear programming for the Bayesian networkstructure learning problem.

Artiﬁcial Intelligence , 244:258–271, 2017.[4] Dimitris Bertsimas, Angela King, Rahul Mazumder, et al. Best subset selection via a modernoptimization lens.

The annals of statistics , 44(2):813–852, 2016.235] David Maxwell Chickering. Learning Bayesian networks is NP-complete. In

Learning fromdata , pages 121–130. Springer, 1996.[6] David Maxwell Chickering. Optimal structure identiﬁcation with greedy search.

Journal ofmachine learning research , 3(Nov):507–554, 2002.[7] James Cussens. Maximum likelihood pedigree reconstruction using integer programming. In

WCB@ ICLP , pages 8–19, 2010.[8] James Cussens. Bayesian network learning with cutting planes. arXiv preprintarXiv:1202.3713 , 2012.[9] James Cussens, David Haws, and Milan Studen`y. Polyhedral aspects of score equivalence inBayesian network structure learning.

Mathematical Programming , 164(1-2):285–324, 2017.[10] James Cussens, Matti J¨arvisalo, Janne H Korhonen, and Mark Bartlett. Bayesian networkstructure learning with integer programming: Polytopes, facets and complexity.

J. Artif.Intell. Res.(JAIR) , 58:185–229, 2017.[11] George Dantzig, Ray Fulkerson, and Selmer Johnson. Solution of a large-scale traveling-salesman problem.

Journal of the operations research society of America , 2(4):393–410, 1954.[12] Sanjoy Dasgupta. Learning polytrees. In

Proceedings of the Fifteenth conference on Uncertaintyin artiﬁcial intelligence , pages 134–141. Morgan Kaufmann Publishers Inc., 1999.[13] Daniel Eaton and Kevin Murphy. Exact Bayesian structure learning from uncertain interven-tions. In

Artiﬁcial Intelligence and Statistics , pages 107–114, 2007.[14] Fei Fu and Qing Zhou. Learning sparse causal Gaussian networks with experimental interven-tion: regularization and coordinate descent.

Journal of the American Statistical Association ,108(501):288–300, 2013.[15] Asish Ghoshal and Jean Honorio. Information-theoretic limits of Bayesian network structurelearning. In Aarti Singh and Jerry Zhu, editors,

Proceedings of the 20th International Con-ference on Artiﬁcial Intelligence and Statistics , volume 54 of

Proceedings of Machine LearningResearch , pages 767–775, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.[16] Martin Gr¨otschel, Michael J¨unger, and Gerhard Reinelt. On the acyclic subgraph polytope.

Mathematical Programming , 33(1):28–42, 1985.[17] Sung Won Han, Gong Chen, Myun-Seok Cheon, and Hua Zhong. Estimation of directed acyclicgraphs through two-stage adaptive lasso for gene network inference.

Journal of the AmericanStatistical Association , 111(515):1004–1019, 2016.[18] Patrick Healy and Nikola S Nikolov. A branch-and-cut approach to the directed acyclic graphlayering problem. In

International Symposium on Graph Drawing , pages 98–109. Springer,2002.[19] Tommi Jaakkola, David Sontag, Amir Globerson, and Marina Meila. Learning Bayesian net-work structure using LP relaxations. In

Proceedings of the Thirteenth International Conferenceon Artiﬁcial Intelligence and Statistics , pages 358–365, 2010.2420] Markus Kalisch and Peter B¨uhlmann. Estimating high-dimensional directed acyclic graphswith the PC-algorithm.

Journal of Machine Learning Research , 8(Mar):613–636, 2007.[21] Mikko Koivisto. Advances in exact Bayesian structure discovery in Bayesian networks. arXivpreprint arXiv:1206.6828 , 2012.[22] Mikko Koivisto and Kismat Sood. Exact Bayesian structure discovery in Bayesian networks.

Journal of Machine Learning Research , 5(May):549–573, 2004.[23] Daphne Koller and Nir Friedman.

Probabilistic graphical models: principles and techniques .MIT press, 2009.[24] Steﬀen L. Lauritzen.

Graphical Models . Oxford University Press, 1996.[25] Po-Ling Loh and Peter B¨uhlmann. High-dimensional learning of linear causal networks viainverse covariance estimation.

The Journal of Machine Learning Research , 15(1):3065–3105,2014.[26] Brandon Malone, Kustaa Kangas, Matti J¨arvisalo, Mikko Koivisto, and Petri Myllym¨aki.Predicting the hardness of learning Bayesian networks. In

AAAI , pages 2460–2466, 2014.[27] Florian Markowetz and Rainer Spang. Inferring cellular networks–a review.

BMC bioinfor-matics , 8(6):S5, 2007.[28] George L Nemhauser and Laurence A Wolsey. Integer programming and combinatorial opti-mization.

Wiley, Chichester. GL Nemhauser, MWP Savelsbergh, GS Sigismondi (1992). Con-straint Classiﬁcation for Mixed Integer Programming Formulations. COAL Bulletin , 20:8–12,1988.[29] Young Woong Park and Diego Klabjan. Bayesian network learning via topological order.

TheJournal of Machine Learning Research , 18(1):3451–3482, 2017.[30] Pekka Parviainen and Mikko Koivisto. Exact structure discovery in Bayesian networks with lessspace. In

Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence ,pages 436–443. AUAI Press, 2009.[31] Judea Pearl et al. Causal inference in statistics: An overview.

Statistics surveys , 3:96–146,2009.[32] Eric Perrier, Seiya Imoto, and Satoru Miyano. Finding optimal Bayesian network given asuper-structure.

Journal of Machine Learning Research , 9(Oct):2251–2286, 2008.[33] Jonas Peters and Peter B¨uhlmann. Identiﬁability of Gaussian structural equation models withequal error variances.

Biometrika , 101(1):219–228, 2013.[34] Garvesh Raskutti and Caroline Uhler. Learning directed acyclic graphs based on sparsestpermutations. arXiv preprint arXiv:1307.0366 , 2013.[35] Ali Shojaie and George Michailidis. Penalized likelihood methods for estimation of sparsehigh-dimensional directed acyclic graphs.

Biometrika , 97(3):519–538, 2010.2536] Tomi Silander and Petri Myllymaki. A simple approach for ﬁnding the globally optimalBayesian network structure. arXiv preprint arXiv:1206.6875 , 2012.[37] Liam Solus, Yuhao Wang, Lenka Matejovicova, and Caroline Uhler. Consistency guaranteesfor permutation-based causal inference algorithms. arXiv preprint arXiv:1702.03530 , 2017.[38] A Sondhi and A Shojaei. The reduced PC-algorithm: Improved causal structure learning inlarge random networks. arXiv preprint arXiv:1806.06209 , 2018.[39] Peter Spirtes, Clark N Glymour, and Richard Scheines.

Causation, prediction, and search .MIT press, 2000.[40] Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the RoyalStatistical Society. Series B (Methodological) , pages 267–288, 1996.[41] Sara van de Geer and Peter B¨uhlmann. (cid:96) -penalized maximum likelihood for sparse directedacyclic graphs. The Annals of Statistics , 41(2):536–567, 2013.[42] Jing Xiang and Seyoung Kim. A* lasso for learning a sparse Bayesian network structure forcontinuous variables. In

Advances in neural information processing systems , pages 2418–2426,2013.[43] Changhe Yuan and Brandon Malone. Learning optimal Bayesian networks: A shortest pathperspective.

Journal of Artiﬁcial Intelligence Research , 48:23–65, 2013.[44] Changhe Yuan, Brandon Malone, and Xiaojian Wu. Learning optimal Bayesian networksusing A* search. In

IJCAI proceedings-international joint conference on artiﬁcial intelligence ,volume 22, page 2186, 2011.[45] Bin Zhang, Chris Gaiteri, Liviu-Gabriel Bodea, Zhi Wang, Joshua McElwee, Alexei APodtelezhnikov, Chunsheng Zhang, Tao Xie, Linh Tran, Radu Dobrin, et al. Integrated sys-tems approach identiﬁes genetic nodes and networks in late-onset Alzheimer’s disease.

Cell ,153(3):707–720, 2013.[46] Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P Xing. DAGs with NO TEARS:smooth optimization for structure learning. arXiv preprint arXiv:1803.01422 , 2018.[47] Hui Zou. The adaptive lasso and its oracle properties.

Journal of the American statisticalassociation , 101(476):1418–1429, 2006. 26 ppendix I

PROOF OF PROPOSITION 1

Let ( ˆ β, ˆ z ) be an optimal solution for (5) with an optimal objective value F ( ˆ β ). Let us refer to theDAG structure corresponding to this optimal solution by DAG ( V, ˆ E → ). Suppose that for some( j, k ), we have ˆ z jk + ˆ z kj = 0. To prove the proposition, we construct an optimal solution whichsatisﬁes z jk + z kj = 1 for all pairs of ( j, k ) and meets the following conditions: (i) this correspondingDAG (tournament) is cycle free (ii) this tournament has the same objective value, i.e., F ( ˆ β ), as anoptimal DAG.Select a pair of nodes, say p, q ∈ M , p (cid:54) = q from DAG ( V, ˆ E ) for which ˆ z pq + ˆ z qp = 0. If there is adirected path from p to q (respectively q to p ), then we can add the following arc ( p, q ) (respectively,( q, p )). This arc does not create a cycle in the graph. If there is no directed path between p and q ,we can add an arc in either direction. In all cases, set β value corresponding to the added arc tozero. We repeat this process for all pairs of nodes with ˆ z pq + ˆ z qp = 0.We can add such arcs without creating any cycle. This is because if we cannot add an arc ineither direction, it implies that we should have a directed path from p to q and a directed pathfrom q to p in graph DAG ( V, ˆ E ) which is a contradiction because it implies a directed cycle in anoptimal DAG. Note that in each step, we maintain a DAG. Hence, by induction we conclude thatcondition (i) is satisﬁed. The pair of nodes can be chosen arbitrarily.Since in the constructed solution we set β for the added arcs as zero, the objective value doesnot change. This satisﬁes condition (ii), and completes the proof. (cid:3) PROOF OF PROPOSITION 2

First we prove that (10e) removes all cycles. Suppose, for contradiction, that a cycle of size p ≥ , , . . . , p, z j +1 ,j = 0 and z j,j +1 = 1 , ∀ j = { , . . . , p − } , and z p, = 1 , z ,p = 0. Then, 1 = z − mz ≤ ψ − ψ , z − mz ≤ ψ − ψ , ...1 = z p − ,p − mz p,p − ≤ ψ p − ψ , z p, − mz ,p ≤ ψ − ψ p . We sum the above inequalities and conclude p ≤

0, a contradiction.To complete the proof, we also need to prove that any DAG is feasible for the LN formulation.To this end, we know that each DAG has a topological ordering. For all the existing arcs in theDAG, substitute z jk = 1 and assign a topological ordering number to the variables ψ k , k ∈ V inLN. Then, the set of constraints in (10e) is always satisﬁed. (cid:3) PROOF OF PROPOSITION 3

The LO formulation is in w -space whereas LN is in ( z, ψ )-space. Hence, we ﬁrst construct a mappingbetween these decision variables.Given a feasible solution w jk for all j, k ∈ V, j (cid:54) = k in the LO formulation, we deﬁne z jk = w jk ,( j, k ) ∈ −→ E and ψ j = (cid:80) (cid:96) ∈ V \{ j } w (cid:96)j , j ∈ V . Let ( x ≥

0) be a function which takes value 1 if x ≥ z jk for all ( j, k ) ∈ −→ E and ψ j for j ∈ V in the LN formulation, we map27 jk = z jk for all ( j, k ) ∈ −→ E and w jk = ( ψ k − ψ j + 1 ≥

0) for all ( j, k ) / ∈ −→ E . Note that w -spaceis deﬁned for all pair of nodes whereas z -space is deﬁned for the set of arcs in −→ E . In every correctmapping, we have w jk = z jk , ∀ ( j, k ) ∈ −→ E .Fixing j and k for each ( j, k ) ∈ −→ E and summing the left hand-side of inequalities (8e) over i ∈ V \ { j, k } we obtain( m − w jk + (cid:88) i ∈ V \{ k,j } w ki + (cid:88) i ∈ V \{ j,k } w ij ≤ m − ≡ ( m − w jk + (cid:88) i ∈ V \{ k,j } w ki + (cid:88) i ∈ V \{ j,k } (1 − w ji ) ≤ m − ≡ mw jk − (cid:88) i ∈ V \{ k } w ki − (cid:88) i ∈ V \{ j } w ji ≤ m − ≡ mw jk − − (cid:88) i ∈ V \{ k } w ik + (cid:88) i ∈ V \{ j } w ij ≤ m − ≡ mw jk − m + 1 ≤ (cid:88) i ∈ V \{ k } w ik − (cid:88) i ∈ V \{ j } w ij , where the equivalences follow from constraints (8d). Given our mapping, z jk = w jk for all ( j, k ) ∈ −→ E and ψ j = (cid:80) (cid:96) ∈ V \{ j } w (cid:96)j for j ∈ V , the above set of constraints can be written as z jk − ( m − z kj ≤ ψ k − ψ j ∀ ( j, k ) ∈ −→ E , which satisﬁes (10e) in the LN formulation. This implies R ( LO ) ⊆ R ( LN ) for (cid:96) -regularization.For (cid:96) -regularization, we need to add that g jk ≤ w jk = z jk , ∀ ( j, k ) ∈ E → . This implies R ( LO ) ⊆R ( LN ) for (cid:96) regularization.To show strict containment, we give a point that is feasible to the LN formulation that cannotbe mapped to any feasible point in the LO formulation. Consider m = 3, z = 1 , z = 0 , z =0 . (cid:15), z = 0 . − (cid:15), z = z = 0 . ψ = (1 , , < (cid:15) < , with an appropriate choice of β . It is easy to check that this is a feasible solution to the LN formulation. Because we must have w ij = z ij , ∀ ( i, j ) ∈ −→ E , we have w + w + w >

2. Therefore, the corresponding point is infeasibleto the LO formulation and this completes the proof. (cid:3)

PROOF OF PROPOSITION 4

This proof is for TO formulation when the parameter m on (9e) is replaced with m − (cid:80) s ∈ V so ks as ψ k and the term (cid:80) s ∈ V so js as ψ j . Further,remove the set of constraints in (9f), (9g), and (9i). This implies that R ( T O ) ⊆ R ( LN ). To seestrict containment, consider the point described in the proof of Proposition 3, which is feasible tothe LN formulation. For this point, there can be no feasible assignment of the decision matrix o such that ψ j = (cid:80) s ∈ V so js , hence R ( T O ) ⊂ R ( LN ). (cid:3) PROOF OF PROPOSITION 5

Let ¯ F ( β (cid:63)X ) denote the optimal objective value associated with the continuous relaxation of model28 ∈ { CO, LO, LN } . Part A. ¯ F ( β (cid:63)LO ) = ¯ F ( β (cid:63)LN ).Case 1. (cid:96) regularizationSuppose ( β (cid:63)LN , z (cid:63) ) is an optimal solution associated with the continuous relaxation of the LNformulation (10a)-(10g) and ( β (cid:63)LO , w (cid:63) ) is an optimal solution associated with continuous relaxationof the LO formulation with (cid:96) -regularization.Given Proposition 3, we conclude that ¯ F ( β (cid:63)LN ) ≤ ¯ F ( β (cid:63)LO ). We prove that ¯ F ( β (cid:63)LN ) ≥ ¯ F ( β (cid:63)LO )also holds in an optimal solution. To this end, we map an optimal solution in continuous relaxationof the (cid:96) -LN formulation to a feasible solution in continuous relaxation of the (cid:96) -LO formulationwith the same objective function. This implies ¯ F ( β (cid:63)LN ) ≥ ¯ F ( β (cid:63)LO ).Given an optimal solution ( β (cid:63)LN , z (cid:63) ) to the continuous relaxation of the (cid:96) -LN formulation, weconstruct a feasible solution ( β LO , w ) to the continuous relaxation of the LO formulation as w jk = w kj = (cid:40) , z (cid:63)jk > , ( i, j ) ∈ −→ E , , otherwise , and let β LO = β ∗ LN .We now show that this mapping is always valid for the (cid:96) -LO formulation. Recall that for the (cid:96) regularization, we do not have the decision vector g in the formulation. Three set of constraintshave to be satisﬁed in the (cid:96) -LO formulation. | β jk | ≤ M w jk , ∀ ( j, k ) ∈ −→ E (8b) w jk + w kj = 1 , ∀ ( j, k ) ∈ −→ E (8d) w ij + w jk + w ki ≤ , ∀ ( i, j ) , ( j, k ) , ( k, i ) ∈ −→ E i (cid:54) = j (cid:54) = k. (8e)The set of constraints (8b) is trivially satisﬁed because we set M ≥ ( j,k ) ∈−→ E | β (cid:63)jk | . The set ofconstraints (8d) is trivially satisﬁed. The set of constraints (8e) is satisﬁed because the left handside of inequality can take at most given this mapping. Therefore, ¯ F ( β (cid:63)LO ) ≤ ¯ F ( β LO ) = ¯ F ( β (cid:63)LN ).This completes this part of the proof.Case 2. (cid:96) regularizationSuppose ( β (cid:63)LN , g (cid:63)LN , z (cid:63) ) is an optimal solution associated with a continuous relaxation of the (cid:96) -LNformulation, and ( β (cid:63)LO , g (cid:63)LO , w (cid:63) ) is an optimal solution associated with a continuous relaxation ofthe (cid:96) -LO formulation.Given Proposition 3, ¯ F ( β (cid:63)LN , g (cid:63)LN ) ≤ ¯ F ( β (cid:63)LO , g (cid:63)LO ). We now prove that, in an optimal solution¯ F ( β (cid:63)LN , g (cid:63)LN ) ≥ ¯ F ( β (cid:63)LO , g (cid:63)LO ) also holds.Given an optimal solution ( β (cid:63)LN , g (cid:63)LN , z (cid:63) ) for the continuous relaxation of the (cid:96) -LN formulation,we construct a feasible solution ( β LO , g LO , w ) for the continuous relaxation of the (cid:96) -LO formulationas w jk = w kj = (cid:40) , z (cid:63)jk > , ( j, k ) ∈ −→ E , , otherwise, and let β LO = β ∗ LN and g LO = g (cid:63)LN .We now show that this mapping is always valid for the LO formulation. Four sets of constraintshave to be satisﬁed in the LO formulation. 29 β jk | ≤ M g jk , ∀ ( j, k ) ∈ −→ E , (8b) g jk ≤ w kj , ∀ ( j, k ) ∈ −→ E (8c) w jk + w kj = 1 , ∀ ( j, k ) ∈ −→ E (8d) w ij + w jk + w ki ≤ , ∀ i, j, k ∈ V, i (cid:54) = j (cid:54) = k, (8e)The set of constraints in (8d) is satisﬁed similar to (cid:96) case. The proof that constraints (8b),(8c), and (8e) are also met is more involved. If the (cid:96) -LN formulation attains a solution for which g (cid:63)ij = g (cid:63)jk = g (cid:63)ki = 1 (we dropped subscript LN), then our mapping leads to an infeasible solutionto the (cid:96) -LO formulation, because it forces w ij + w jk + w ki ≥ (cid:96) -LO. Next we show thatthis will not be the case and that our mapping is valid. To this end, we show that in an optimalsolution for the continuous relaxation of (cid:96) -LN, we always have g (cid:63)jk ≤ , ∀ ( j, k ) ∈ −→ E . Note that inthe LN formulation, we have | β jk | ≤ M g jk , ∀ ( j, k ) ∈ E → and g jk ≤ z jk (we dropped the subscriptLN). Suppose that g (cid:63)jk ≥ . In this case, the objective function forces g (cid:63)jk to be at most . Notethat the objective function can reduce g (cid:63)jk up to and decreases the regularization term withoutany increase on the loss function. This is because β jk ≤ M g jk , ∀ ( j, k ) ∈ E → can be replaced by β jk ≤ M , ∀ ( j, k ) ∈ E → without any restriction on β because M ≥ ( j,k ) ∈−→ E | β (cid:63)jk | . Therefore, ourmapping is valid and implies that ¯ F ( β (cid:63)LO , g (cid:63)LO ) ≤ ¯ F ( β LO , g LO ) = ¯ F ( β (cid:63)LN , g (cid:63)LN ). Part B. ¯ F ( β (cid:63)T O ) = ¯ F ( β (cid:63)LN ) . Case 1. (cid:96) regularizationGiven Proposition 4, we conclude that F ( β (cid:63)LN ) ≤ F ( β (cid:63)T O ). We now prove that ¯ F ( β (cid:63)LN ) ≥ ¯ F ( β (cid:63)T O )also holds in an optimal solution. We map an optimal solution in continuous relaxation of the (cid:96) -LN formulation to a feasible solution in continuous relaxation of the (cid:96) -TO formulation with thesame objective value. This implies ¯ F ( β (cid:63)LN ) ≥ ¯ F ( β (cid:63)T O ).Next we construct a feasible solution ( β T O , z

T O , o ) to the TO formulation. Given an optimalsolution ( β (cid:63)LN , z (cid:63)LN , ψ (cid:63) ) for a continuous relaxation of the LN formulation, rank the ψ (cid:63)j in non-descending order. Ties between ψ values can be broken arbitrarily in this ranking. Then, foreach variable j ∈ { , . . . , m } , o jr = 1 where r denotes the rank of ψ j in a non-descending order(the ﬁrst element is ranked 0). This mapping satisﬁes the two assignment constraints (9f)-(9g).Let β T O = β ∗ LN , z T O = z ∗ LN . This gives a feasible solution for the TO formulation. Thus,¯ F ( β (cid:63)LN ) ≥ ¯ F ( β (cid:63)T O ) . Case 2. (cid:96) regularizationDeﬁne g (cid:63)LN = g T O . The rest of the proof is similar to the previous proofs.

Part C. ¯ F ( β (cid:63)CP ) = ¯ F ( β (cid:63)LN ).To prove this, we ﬁrst prove the following Lemma. Lemma . The LO formulation is at least as strong as the CP formulation, that is R ( LO ) ⊆ R ( CP ) . Proof.

Consider (8d)-(8e) in the linear ordering formulation given by w jk + w kj = 1 ∀ ( j, k ) ∈ −→ E ,w ij + w jk + w ki ≤ ∀ ( i, j ) , ( j, k ) , ( k, i ) ∈ −→ E . (cid:88) ( j,k ) ∈ C A g jk ≤ |C A | − ∀C A ∈ C , Consider the same mapping as in Proposition 3. Select an arbitrary constraint from (6c).Without loss of generality, consider g , + g , + · · · + g p − ,p + g p, ≤ p −

1. We arrange the termsin (8d)-(8e) as w , + w , + w , ≤ w , + w , + w , ≤ w , + w , + w , ≤ w ,p − + w p − ,p − + w p − , ≤ w ,p − + w p − ,p + w p, ≤ w ij = 1 − w ji , when appropriate, gives w , + w , + w , + · · · + w p, ≤ p −

1. Thus, we conclude that R ( LO ) ⊆ R ( CP ). (cid:3) Given this lemma and Proposition 3, we conclude that ¯ F ( β (cid:63)LN ) ≥ ¯ F ( β (cid:63)T O ). (cid:3) PROOF OF PROPOSITION 6

This is a generalization of Proposition 5. Suppose we have branched on variable w jk in the LOformulation or correspondingly on variable z jk in the LN formulation. If w jk = 0, then β jk = 0.In this case, it is as if we now need to solve the original model with ( j, k ) / ∈ −→ E . Thus, Proposition5 (Part A) implies that both models attain the same continuous relaxation. On the other hand, if w jk = 1 in LO (or correspondingly z jk = 1 in LN), we deﬁne our mapping as w jk = w kj =  , z (cid:63)jk > , ( j, k ) ∈ −→ E , , ( j, k ) ∈ B , otherwise , where B is the set of ( j, k ) for which z jk = 1. The rest of the proof follows from Proposition 5 (PartA).The proofs for the TO and CP formulations are almost identical with Proposition 5. Supposewe have branched on variable z jk on any of the models (CP, LO, TO). If z jk = 0, then β jk = 0. Inthis case, it is as if we now need to solve the original model with ( j, k ) / ∈ −→ E . If z jk = 1, one deﬁnesthe mapping in Proposition 5. The proof follows from Proposition 5 parts B and C, respectively. (cid:3) ppendix II O p t i m a li t y G A P ( M I P G A P ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes O p t i m a li t y G A P ( M I P G A P ) =0.1

10 20 30 40

Number of nodes =0.1 (a) Optimality GAPs for MIQPs T i m e ( s e c o n d s ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes T i m e ( s e c o n d s ) =0.1

10 20 30 40

Number of nodes =0.1 (b) Time (in seconds) for MIQPs U pp e r B o un d Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes U pp e r B o un d =0.1

10 20 30 40

Number of nodes =0.1 (c) Best upper bounds for MIQPs L o w e r B o un d Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes L o w e r B o un d =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (d) Best lower bounds for MIQPs T i m e ( s e c o n d s ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes T i m e ( s e c o n d s ) =0.1

10 20 30 40

Number of nodes =0.1 (e) Time (in seconds) for continuous root relaxation N u m b e r o f N o d e s i n B & B Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes N u m b e r o f N o d e s i n B & B =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (f) Number of explored nodes in B&B tree

Figure 10: Optimization-based measures for MIQPs for (cid:96) model with number of samples n = 100.32 .00.20.40.6 T r u e P o s i t i v e R a t e ( T P R ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes T r u e P o s i t i v e R a t e ( T P R ) =0.1

10 20 30 40

Number of nodes =0.1 (a) True Positive Rate (TPR) for MIQPs T r u e P o s i t i v e R a t e ( T P R ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes T r u e P o s i t i v e R a t e ( T P R ) =0.1

10 20 30 40

Number of nodes =0.1 (b) True Positive Rate(TPR) for MIQPs F a l s e P o s i t i v e R a t e ( F P R ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes F a l s e P o s i t i v e R a t e ( F P R ) =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (c) False Positive Rate(FPR) for MIQPs F a l s e P o s i t i v e R a t e ( F P R ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes F a l s e P o s i t i v e R a t e ( F P R ) =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (d) False Positive Rate(FPR) for MIQPs

Figure 11: Graph metrics for the MIQPs for (cid:96) regularization. Plots a, c (left) show graph metricswith the number of samples n = 1000 and plots b, d (right) show the graph metrics with the numberof samples n = 100. 33 .000.020.040.060.080.10 O p t i m a li t y G A P ( M I P G A P ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes O p t i m a li t y G A P ( M I P G A P ) =0.1

10 20 30 40

Number of nodes =0.1 (a) Optimality GAPs for MIQPs T i m e ( s e c o n d s ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes T i m e ( s e c o n d s ) =0.1

10 20 30 40

Number of nodes =0.1 (b) Time (in seconds) for MIQPs U pp e r B o un d Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes U pp e r B o un d =0.1

10 20 30 40

Number of nodes =0.1 (c) Best obtained upper bounds for MIQPs L o w e r B o un d Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes L o w e r B o un d =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (d) Best obtained lower bounds for MIQPs T i m e ( s e c o n d s ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes T i m e ( s e c o n d s ) =0.1

10 20 30 40

Number of nodes =0.1 (e) Time (in seconds) for continuous root relaxation N u m b e r o f N o d e s i n B & B Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes N u m b e r o f N o d e s i n B & B =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (f) Number of explored nodes in B&B tree

Figure 12: Optimization-based measures for MIQPs for (cid:96) model with the number of samples n = 100. 34 .00.20.40.60.81.0 T r u e P o s i t i v e R a t e ( T P R ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes T r u e P o s i t i v e R a t e ( T P R ) =0.1

10 20 30 40

Number of nodes =0.1 (a) True Positive Rate (TPR) for MIQPs T r u e P o s i t i v e R a t e ( T P R ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes T r u e P o s i t i v e R a t e ( T P R ) =0.1

10 20 30 40

Number of nodes =0.1 (b) True Positive Rate (TPR) for MIQPs F a l s e P o s i t i v e R a t e ( F P R ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes F a l s e P o s i t i v e R a t e ( F P R ) =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (c) False Positive Rate (FPR) for MIQPs F a l s e P o s i t i v e R a t e ( F P R ) Moral =1.0

Complete =1.0

10 20 30 40

Number of nodes F a l s e P o s i t i v e R a t e ( F P R ) =0.1

10 20 30 40

Number of nodes =0.1

LN TO CP LO (d) False Positive Rate (FPR) for MIQPs

Figure 13: Graph metrics for MIQPs for (cid:96) regularization. Plots a, c (left) show graph metrics withthe number of samples n = 1000. Plots b, d (right) show the graph metrics with the number ofsamples nn