Bayesian Network Structure Learning Using Quantum Annealing
Bryan O'Gorman, Alejandro Perdomo-Ortiz, Ryan Babbush, Alan Aspuru-Guzik, Vadim Smelyanskiy
BBayesian Network Structure Learning Using Quantum Annealing
Bryan A. O’Gorman , Alejandro Perdomo-Ortiz , Ryan Babbush ,Al´an Aspuru-Guzik , and Vadim Smelyanskiy Quantum Artificial Intelligence Laboratory, NASA Ames Research Center, Moffett Field, CA Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA
October 3, 2014
Abstract
We introduce a method for the problem of learning the structure of a Bayesian network using thequantum adiabatic algorithm. We do so by introducing an efficient reformulation of a standard posterior-probability scoring function on graphs as a pseudo-Boolean function, which is equivalent to a system of2-body Ising spins, as well as suitable penalty terms for enforcing the constraints necessary for the refor-mulation; our proposed method requires O ( n ) qubits for n Bayesian network variables. Furthermore,we prove lower bounds on the necessary weighting of these penalty terms. The logical structure result-ing from the mapping has the appealing property that it is instance-independent for a given number ofBayesian network variables, as well as being independent of the number of data cases.
Bayesian networks are a widely used probabilisticgraphical model in machine learning [1]. A Bayesiannetwork’s structure encapsulates conditional inde-pendence within a set of random variables, and,equivalently, enables a concise, factored representa-tion of their joint probability distribution. There aretwo broad classes of computational problems asso-ciated with Bayesian networks: inference problems,in which the goal is to calculate a probability dis-tribution or the mode thereof given a Bayesian net-work and the state of some subset of the variables;and learning problems, in which the goal is to findthe Bayesian network most likely to have produceda given set of data. Here, we focus on the latter,specifically the problem of Bayesian network struc-ture learning. Bayesian network structure learninghas been applied in fields as diverse as the short-termprediction of solar-flares [2] and the discovery of generegulatory networks [3, 4]. The problem of learningthe most likely structure to have produced a givendata set, with reasonable formal assumptions to beenumerated later, is known to be NP -complete [5],so its solution in practice requires the use of heuris-tics.Quantum annealing is one such heuristic. Thoughefficient quantum algorithms for certain problemsare exponentially faster than their classical counter- part, it is believed that quantum computers cannotefficiently solve NP -complete problems [6]. How-ever, there exist quantum algorithms (for problemsnot known or believed to be NP -complete) thathave a provable quadratic speedup over classical ones[7, 8]. There is therefore reason to believe quantum-mechanical effects such as tunneling could providea polynomial speedup over classical computation forsome sets of problems. The recent availability ofquantum annealing devices from D-Wave Systemshas sparked interest in the experimental determi-nation of whether or not the current generation ofthe device provides such speedup [9, 10]. Whilethere exists prior work related to “quantum Bayesiannetworks”[11] and the “quantum computerization” ofclassical Bayesian network methods [12], the resultspresented here are unrelated.In this paper, we describe how to efficiently map acertain formulation of Bayesian Network Struc-ture Learning ( BNSL ) to
Quadratic Uncon-strained Binary Optimization ( QUBO ). The
QUBO formalism is useful because it is mathemati-cally equivalent to that of a set Ising spins with ar-bitrary 2-body interactions, which can be mappedto the Ising spins with a limited 2-body interactiongraph as implementable by physical quantum anneal-ing devices. Similar mappings have been developedand implemented for lattice protein folding [13, 14],planning and scheduling [15], fault diagnosis [16],1 a r X i v : . [ qu a n t - ph ] O c t raph isomorphism [17], training a binary classifier[18, 19], and the computation of Ramsey numbers[20].To map BNSL to QUBO , we first encode alldigraphs using a set of Boolean variables, each ofwhich indicates the presence or absence of an arc(i.e. directed edge), and define a pseudo-Booleanfunction on those variables that yields the score ofthe digraph encoded therein so long as it satisfies thenecessary constraints. This function is not necessar-ily quadratic, and so we apply standard methods toquadratize (i.e. reduce the degree to two) using an-cillary variables. We then introduce ancillary vari-ables and add additional terms to the pseudo-Booleanfunction corresponding to constraints, each of whichis zero when the corresponding constraint is satisfiedand positive when it is not. The resulting
QUBO in-stance is defined over O ( n ) Boolean variables whenmapped from a BNSL instance with n Bayesian net-work variables. Interestingly, the structure of the
QUBO is instance-independent for a fixed
BNSL size. Because embedding the structure of
QUBO into physical hardware is generally computationallydifficult, this is an especially appealing feature of themapping.We also show sufficient lower bounds on penaltyweights used to scale the terms in the Hamiltonianthat penalize invalid states, like those containing adirected cycle or with parent sets larger than allowed.In a physical device, setting the penalty weights toohigh is counterproductive because there is a fixedmaximum energy scale. The stronger the penaltyweights, the more the logical energy spectrum is com-pressed, which is problematic for two reasons: first,the minimum gap, with which the running time ofthe algorithm scales inversely, is proportionally com-pressed, and, second, the inherently limited precisionof a physical device’s implementation of the interac-tion strengths prevents sufficient resolution of logicalstates close in energy as the spectrum is compressed.The utility of the mapping from
BNSL to QUBO introduced here is not limited to quantum anneal-ing. Indeed, the methods used here were moti-vated by a previous mapping of the same problemto weighted
MAX-SAT [21]. Existing simulated an-nealing code is highly optimized [22] and may be ap-plied to
QUBO instances derived from our mapping.In that case, there is no need to quadratize, becausesimulated annealing does not have the limitation to2-body interactions that physical devices do. Withrespect to penalty weights, while simulated anneal-ing does not have the same gap and precision issuespresent in quantum annealing, there may still be rea-son to avoid setting the penalty weights too high. Be- cause the bits corresponding to arcs with different i.e.terminal vertices do not interact directly, many validstates are separated by invalid ones, and so penaltyweights that are too strong may erect barriers thattend to produce basins of local optima. While simu-lated annealing directly on digraph structures is pos-sible, mapping to
QUBO and performing simulatedannealing in that form has the advantage that it en-ables the exploitation of existing, highly optimizedcode, as well as providing an alternative topology ofthe solution space and energy landscape.
BNSL has a special property that makes it es-pecially well-suited for the application of heuristicssuch as QA: Unlike in other problems where anythingbut the global minimum is undesirable or those inwhich an approximate solution is sufficient, in
BNSL there is utility in having a set of high scoring DAGs.The scoring function encodes the posterior probabil-ity, and so sub- but near-optimal solution may be al-most as probable as the global optimum. In practice,because quantum annealing is an inherently stochas-tic procedure, it is run many times for the same in-stance, producing a set of low-energy states. In caseswhere the BN structure is learned for the purpose ofdoing inference on it, a high-scoring subset of manyquantum annealing runs can utilized by performingBayesian model averaging, in which inference is doneon the set of likely BNs and the results averaged pro-portionally.In Section 2, we review the formalism of Bayesiannetworks and
BNSL (2.1) and quantum annealing(2.2), elucidating the features that make the lattersuitable for finding solutions of the former. In Section3, we develop an efficient and instance-independentmapping from
BNSL to QUBO . In Section 4, we pro-vide sufficient lower bounds on the penalty weightsin the aforementioned mapping. In Section 5, we dis-cuss useful features of the mapping and conclude. Inthe Appendix, we prove the sufficiency of the lowerbounds given; the methods used to do so may be use-ful in mappings for other problems.
A Bayesian network (BN) is a probabilistic graphicalmodel for a set of random variables that encodes theirjoint probability distribution in a more compact wayand with fewer parameters than would be requiredotherwise by taking into account conditional inde-pendences among the variables. It consists of both2 directed acyclic graph (DAG) whose vertices corre-spond to the random variables and an associated setof conditional probabilities for each vertex. Here andthroughout the literature, the same notation is usedfor both a random variable and its corresponding ver-tex, and the referent will be clear from context.Formally, a BN B for n random variables X = ( X i ) ni =1 is a pair ( B S , B P ), where B S is aDAG representing the structure of the networkand B P is the set of conditional probabilities { p ( X i | Π i ( B S )) | ≤ i ≤ n } that give the probabilitydistribution for the state of a variable X i conditionedon the joint state of its parent set Π i ( B S ) (those vari-ables for which there are arcs in the structure B S from the corresponding vertices to that correspond-ing to X i ; we will write simply Π i where the struc-ture is clear from context). Let r i denote the numberof states of the variable X i and q i = (cid:81) j ∈ Π i r j de-note the number of joint states of the parent set Π i of X i (in B S ). Lowercase variables indicate realiza-tions of the corresponding random variable; x ik indi-cates the k -th state of variable X i and π ij indicatesthe j -th joint state of the parent set Π i . The set ofconditional probabilities B P consists of n probabilitydistributions (cid:0) ( θ ij ) q i j =1 (cid:1) ni =1 , where θ ij = ( θ ijk ) r i k =1 isthe conditional probability distribution for the states( x ik ) r i k =1 of the variable X i given the joint state π ij of its parents Π i (i.e. p ( x ik | π ij ) = θ ijk ).Given a database D = { x i | ≤ i ≤ N } consist-ing of N cases, where each x i denotes the state ofall variables X , the goal is to find the structure thatmaximizes the posterior distribution p ( B S | D ) out ofall possible structures. By Bayes’s Theorem, p ( B S | D ) = p ( D | B S ) p ( B S ) p ( D ) . (1)The marginal probability of the database p ( D ) is thesame for all structures, so assuming that each struc-ture is equally likely, this simplifies to p ( B S | D ) ∝ p ( D | B S ) . (2)In Section 3.5, we describe how to account for cer-tain types of non-uniform prior distributions over thegraph structures. With certain further reasonable as-sumptions, namely multinomial sampling, parameterindependence and modularity, and Dirichlet priors,the latter conditional probability is p ( D | B S ) = n (cid:89) i =1 q i (cid:89) j =1 Γ( α ij )Γ( N ij + α ij ) r i (cid:89) k =1 Γ( N ijk + α ijk )Γ( α ijk ) , (3)where N ijk is the number of cases in D such that vari-able X i is in its k -th state and its parent set Π i is in its j -th state, N ij = (cid:80) r i k =1 N ijk , α ijk is the hyperparam-eter for θ ijk in the Dirichlet distribution from which θ ij is assumed to be drawn, and α ij = (cid:80) r i k =1 α ijk [23].Given a database D , our goal is equivalent to thatof finding the structure with the largest likelihood,i.e. the structure that yields the largest probabilityof the given database conditioned on that structure.We do this by encoding all structures into a set ofbits and defining a quadratic pseudo-Boolean func-tion on those bits and additional ancillary bits whoseminimizing bitstring encodes the structure with thelargest posterior probability. Quantum annealing is a method for finding the min-imum value of a given objective function. It is thequantum analogue of classical simulated annealing,where the computation is driven by quantum, ratherthan thermal, fluctuations [24]. A similar procedureis called adiabatic quantum computation, in whichthe adiabatic interpolation of a Hamiltonian whoseground state is easily prepared to one whose groundstate encodes the solution to the desired optimiza-tion problem guarantees that final state is indeedthe ground state of the latter [25]. The formalismfor both is similar, and the methods described hereare useful for both. Specifically, the time-dependentHamiltonian is H ( t ) = A ( t ) H + B ( t ) H , (4)for 0 ≤ t ≤ T , where H is the initial Hamiltonian, H is the final Hamiltonian, A ( t ) is a real monotonicfunction such that A (0) = 1 and A ( T ) = 0, and B ( t )is a real monotonic function such that B (0) = 0 and B ( T ) = 1. The adiabatic theorem states that if thesystem starts in the ground state of H and H ( t )varies slowly enough, then the system will be in theground state of H at time T . Using this procedureto solve an optimization problem entails the construc-tion of H such that its ground state encodes the opti-mal solution. In practice, arbitrary Hamiltonians aredifficult to implement, but this is ameliorated by re-sults showing the ability to effectively implement ar-bitrary Hamiltonians using physically-realizable con-nectivity through various gadgetry with reasonableoverhead [26, 27].The main contribution of this paper is a construc-tion of H such that its ground state encodes thesolution for a given instance of BNSL . Specifically,we construct an instance of
QUBO whose solution isthe score-maximizing DAG; there is a simple transfor-mation between a classically defined
QUBO instance3nd a diagonal quantum 2-local Hamiltonian consist-ing of only Pauli Z and ZZ terms [28].When the desired Hamiltonian is diagonal and 2-local an embedding technique called graph-minor em-bedding can be used [29, 30]. A graph G is a minorof another graph H if there exists a mapping fromvertices of G to disjoint, individually connected sub-graphs of H such that for every edge e in G thereis an edge in H whose adjacent vertices are mappedto by the adjacent vertices of the edge e . The de-sired Hamiltonian and hardware are considered asgraphs, called the logical and physical respectively,where qubits correspond to vertices and edges corre-spond to a 2-body interaction, desired or available.Graph-minor embedding consists of two parts: find-ing a mapping of the logical vertices to sets of phys-ical as described, and setting the parameters of thephysical Hamiltonian such that the logical fields aredistributed among the appropriate physical qubitsand there is a strong ferromagnetic coupling betweenphysical qubits mapped to my the same logical qubitso that they act as one. Determining the graph-minormapping, or even if the logical graph is a minor ofthe physical one, is itself NP -hard, and so in practiceheuristics are used [31]. We use n ( n −
1) bits d = ( d ij ) ≤ i
1. If d i ≤ m , let y ∗ i be such that 0 ≤ y ∗ i = m − d i ≤ m ≤ µ −
1. Then H max ( d i , y ∗ i ) = 0. However, when d i > m , because y i ≥
0, we cannot set y i in that way. By taking thederivative with respect to y i , ∂∂y i H ( i )max ( d i , y i ) = 2 δ ( i )max ( y i − m + d i ) > , (18)we see that H ( i )max takes its minimum value over thedomain of y i when y i = 0, and that value is H ( i )max ( d i ,
0) = δ ( i )max ( m − d i ) . (19) Noting that H ( i )max is nonnegative,min y i H ( i )max ( d i , y i ) = (cid:40) , d i ≤ m,δ max ( d i − m ) , d i > m. (20)Thus, if the constraint | d i | ≤ m is satisfied, H ( i )max does nothing, but if | d i | > m , a penalty of at least δ ( i )max is added. Lastly, we must ensure that the structure encodedin { d ij } has no directed cycles. We do so by intro-ducing additional Boolean variables r = ( r ij ) ≤ i 1) + µ . Second, theset of variables { r ij } are almost fully connected. The mapping so far described assumes a uniformprior distribution over all possible DAGs of the ap-propriate size and with the given maximum numberof parents. However, there are situations in which itmay be desirable to fix the presence or absence of anarc in the search space. This could be because of do-main knowledge or because hardware limitations pre-vent the implementation of the mapping for all arcs,in which case resort can be made to iterative searchprocedures such as the bootstrap method [32]. To re-alize the reduction in qubits needed by accounting forsuch a reduced search space, suppose that we wish toonly consider network structures that include the arc( i, j ), where i < j without loss of generality. We thenset d ij = 1, d ji = 0, and r ij = 1. Similarly, if ( i, j ) isto be excluded, we set d ij = d ji = 0 and keep r ij asa free variable. This can be done for any number ofarcs. The Hamiltonian remains unchanged once thesesubstitutions are made, and the lower bounds on thepenalty weight remain sufficient, with the exceptionof the terms used in quadratization in the case m > In the expression above, there are several sets of freeparameters called penalty weights: { δ i max | ≤ i ≤ n } , { δ ij consist | ≤ i, j ≤ n, i (cid:54) = j } , and { δ ijk trans | ≤ i Order Bits H Trans Slack Bits Figure 1: Top: Logical Graph for n = 7 BN variables with a maximum of m = 2 parents. Each vertexcorresponds to a bit in the original QUBO and an edge between two vertices indicates a non-zero quadraticterm containing the corresponding bits. The central cluster is the order bits used to enforce acyclicity; itis highly connected but not complete. Each ”spike” corresponds to a variable X i in the Bayesian network.The outer two vertices are the corresponding slack bits { y il } and the remaining inner vertices are the arcbits { d ji } representing those arcs for which the corresponding Bayesian network variable is the head. Eachspike is a clique, due to H max (independent of which the arc bits for a given BN variable are fully connecteddue to H score ). Each arc bit is connected to a single order bit and each order bit is connected to two arcbits, due to H consist . Bottom: Schematic of the Hamiltonian. The row of disks represents all of the bits inthe original QUBO problem, colored consistently with the logical graph above. They are grouped into threesets: the arc bits representing the presence of the possible arcs, the order bits representing a total orderingby which we enforce acyclicity, and the slack bits used to limit the size of the parent sets. An arrow from agroup of bits to a part of the Hamiltonian indicates that that part of the Hamiltonian is a function of thatset of bits. 7hat the ground state of the total Hamiltonian is thelowest energy state of H score that satisfies the con-straints. Here we provide sufficient lower bounds onthe penalty weight necessary to ensure that this pur-pose is met. No claim is made to their necessity, andtighter lower bounds may exist. It is important tonote that these bounds are mathematical, i.e. theyensure their purpose is met as stated above. In pureadiabatic quantum computation, in which the quan-tum system is in its ground state for the duration ofthe algorithm, this is sufficient (though the computa-tion time necessary for the conditions of the adiabatictheorem to hold may be longer than otherwise if a pe-nalized state has lower energy than the first excitedunpenalized state). In practical quantum annealing,however, a combination of physical effects may causethe optimal value (in the sense of minimizing the en-ergy of the lowest-energy state found, which may ormay not be the global ground state) of the penaltyweights to be less than these bounds. This remainsthe case even for bounds shown to be tight with re-spect to their mathematical properties.The bound presented for each of the three sets ofpenalty weights is based on the notion that only theaddition of an arc (i.e. changing some d ij from 0 to 1)can lead to the violation of two of the constrains weare concerned with: the maximum number of parentsand the consistency of the arc bits and the order bits.Therefore, we can use a basis for how strongly the as-sociated penalty needs to be the greatest difference inthe energy of H score adding each arc can contribute.The penalty for the third constraint, the absence ofdirected 3-cycles among the order bits, will then be afunction of the penalty for the consistency constraint.Formally, we wish to set the penalties such thatfor any d violating at least one of the constraints, wehave min y , r H ( d , y , r ) > H score ( d ∗ ) , (28)where d ∗ ≡ arg min | d (cid:48) |≤ mG ( d (cid:48) ) is a DAG H score ( d (cid:48) ) . (29)This is achieved by showing that for any such d vio-lating at least one constraint, there is another d (cid:48) thatsatisfies all the constraints such thatmin y , r H ( d , y , r ) ≥ min y , r H ( d (cid:48) , y , r ) . (30)Because d (cid:48) satisfies all the constraints,min y , r H ( d (cid:48) , y , r ) = H score ( d (cid:48) ) , (31)which implies the inequality in (28). In this section,we state the bounds and provide brief justification,but relegate the proofs to Appendix B. In this section, we briefly define an auxiliary quantity,∆ (cid:48) ji = − min { d ki | k (cid:54) = i,j } (cid:26) H ( i )score (cid:12)(cid:12)(cid:12) d ji =1 − H ( i )score (cid:12)(cid:12)(cid:12) d ji =0 (cid:27) , (32)that will allow us to define the maximum penaltyweights associated with the bounds described previ-ously. For details of the calculation of this quantity,see Appendix A. In general it is possible that ∆ ji < ji ≡ max { , ∆ (cid:48) ji } . (33)In the proof of the bounds, the two following factswill be useful. Claim 1 (Monotonicity of H max ) . If d ≥ d (cid:48) , then min y H max ( d , y ) ≥ min y H max ( d (cid:48) , y ) . Claim 2 (Monotonicity of H cycle ) . If d ≥ d (cid:48) , then min r H cycle ( d , r ) ≥ min r H cycle ( d (cid:48) , r ) . These say simply that the removal of one or morearcs from G ( d ) cannot increase the values of H max nor H cycle . Here we show a lower bound for { δ ( i )max } that guar-antees that if d is such that max ≤ i ≤ n | d i | > m there exists a d (cid:48) with lesser total energy such thatmax ≤ i ≤ n | d i | ≤ m . To do so, we show that if, forsome d and i , | d i | > m , there is a d (cid:48) such that | d (cid:48) i | = | d i | − y , r H ( d , y , r ) ≥ min y , r H ( d (cid:48) , y , r ) . (34)This idea can be applied iteratively to show thatif, for some d and i , | d i | > m , there is some d (cid:48) with lesser energy such that d (cid:48) j = d j for j (cid:54) = i and | d (cid:48) i | ≤ m . This idea in turn can be applied iterativelyto show that if for some d max ≤ i ≤ n | d i | > m thereis a d (cid:48) such that max ≤ i ≤ n | d (cid:48) i | ≤ m . Claim 3. If δ ( i ) max > max j (cid:54) = i ∆ ji for all ≤ i ≤ n ,then for all d such that, for some i ∗ , | d i ∗ | > m , thereis a d (cid:48) such that | d (cid:48) i ∗ | = | d i ∗ | − , d (cid:48) i = d i for all i (cid:54) = i ∗ , and min y , r H ( d , y , r ) > min y , r H ( d (cid:48) , y , r ) . Claim 4 (Sufficiency of “Maximum” PenaltyWeight) . If δ ( i ) max > max j (cid:54) = i ∆ ji for all i , then forall d such that max i | d i | > m , there is a d (cid:48) ≤ d such that max i | d (cid:48) i | ≤ m and min y , r H ( d , y , r ) > min y , r H ( d (cid:48) , y , r ) . .3 “Reduction” Penalty Weights The degree of the “score” Hamiltonian H score is na-tively m -local as constructed. If m = 2, as it of-ten will be in practice, the total Hamiltonian is na-tively quadratic. If m > 2, additional ancilla bits areneeded to reduce the locality. The general method fordoing this is to replace the conjunction of a pair bitswith an ancilla bit and to add a penalty term withsufficiently strong weighting that penalizes states inwhich the ancillary bit is not equal to the conjunc-tion to which it should be. For m = 3, this can bedone using n (cid:4) ( n − / (cid:5) ancilla bits, but no more,where each H ( i )score containing n − m = 4, at most n (cid:0) n − (cid:1) ancilla bits are needed. More generally, O ( n d ) isancilla bits are needed [34]. Because the proof of thebounds on the other penalty weights are secular asto the degree of H score so long as { ∆ ij } is computedappropriately, the quadratization of H score , includingthe addition of penalty terms and the needed weightstherefor, can be done using the standard methodsdescribed in the literature independent of the otherpenalties described here. First, we that if the consistency penalty is set highenough, for any d encoding a graph with a 2-cycle,there is some d (cid:48) encoding one whose minimal valueof H over y , r is strictly less than that of d . Claim 5 (Removal of 2-cycles.) . If δ ( ij ) consist > max { ∆ ij , ∆ ji } for all ≤ i < j ≤ n , then for all d such that G ( d ) contains a 2-cycle, there is some d (cid:48) ≤ d such that G ( d (cid:48) ) does not contain a 2-cycleand min y , r H ( d , y , r ) > min y , r H ( d (cid:48) , y , r ) . Second, we show that for any d that encodes a di-graph without a 2-cycle, the minimal value of H consist over all r is zero. Claim 6 (Sufficiency of “Consistency” PenaltyWeights) . If δ ( ij )consist > ( n − 2) max k / ∈{ i,j } δ ( ijk )trans for ≤ i < j ≤ n , then for all d such that G ( d ) con-tains no 2-cycle, H consist ( d , r ∗ ) = 0 , where r ∗ =arg min r H cycle ( d , r ) . Third, we show that for any d that encodes a di-graph not containing a 2-cycle but that is not a DAG,there is some d (cid:48) that does encode a DAG and whoseminimal value of H over all y , r is strictly less thanthat of d . Claim 7 (Sufficiency of “Transitivity” PenaltyWeights) . If δ ( ij )consist > ( n − 2) max k / ∈{ i,j } δ ( ijk )trans for ≤ i < j ≤ n and δ ( ijk )trans = δ trans > max ≤ i (cid:48) ,j (cid:48) ≤ ni (cid:48) (cid:54) = j (cid:48) ∆ i (cid:48) j (cid:48) for ≤ i < j < k ≤ n , then for all d such that G ( d ) does not contain a 2-cycle but does contain a directedcycle there is some d (cid:48) such that G ( d (cid:48) ) is a DAG and min y , r H ( d , y , r ) > min y , r H ( d (cid:48) , y , r ) . Lastly, we show that for all d that encode a di-graph that is not a DAG, there is some d (cid:48) that doesencode a DAG and whose minimal value of H overall y , r is strictly less than that of d . Claim 8 (Sufficiency of “Cycle” Penalty Weights) . If δ ( ij )consist > ( n − 2) max k / ∈{ i,j } δ ( ijk )trans for all ≤ i Claim 9 (Overall Sufficiency) . If δ ( i )max > max j (cid:54) = i ∆ ji for all ≤ i ≤ n , δ ( ij )consist > ( n − 2) max k / ∈{ i,j } δ ( ijk )trans for all ≤ i < j ≤ n and δ ( ijk )trans = δ trans > max ≤ i (cid:48) ,j (cid:48) ≤ ni (cid:48) (cid:54) = j (cid:48) ∆ i (cid:48) j (cid:48) for all ≤ i < j < k ≤ n ,then H ( d ∗ , y , r ) = min max i | d i |≤ mG ( d ) is a DAG H score ( d , y , r ) , G ( d ∗ ) is a DAG, and max i | d ∗ i | ≤ m , where d ∗ =arg min d { min y , r H ( d , y , r ) } . The strict inequalities used in the specification ofthe lower bounds ensures that the global ground stateis a score-maximizing DAG with maximum parent setsize m , but replacing them with weak inequalities issufficient to ensure that the ground state energy isthe greatest score over all DAGs with maximum par-ent set size m . However, the latter is of little interestin the present situation because it is the DAG itselfthat is of interest, not its score per se. We have introduced a mapping from the native for-mulation of BNSL to QUBO that enables the solu-tion of the former using novel methods.9he mapping is unique amongst known mappingsof optimization problems to QUBO in that the logi-cal structure is instance-independent for a given prob-lem size. This enables the expenditure of consider-ably more computational resources on the problemof embedding the logical structure into a physical de-vice because such an embedding need only be doneonce and reused for new instances. The problem ad-dressed, BNSL , is special among optimization prob-lems in that approximate solutions thereto often havevalue rivaling that of the exact solution. This prop-erty, along with the general intractability of exactsolution, implies the great value of efficient heuristicssuch as SA or QA implemented using this mapping.At present, only problems of up to seven BN vari-ables can be embedded in existing quantum anneal-ing hardware (i.e. the D-Wave Two chip installedat NASA Ames Research Center), whereas classi-cal methods are able to deal with many of tens ofBN variables. Nevertheless, the quantum state ofthe art is quickly advancing, and it is conceivablethat quantum annealing could be competitively ap-plied to BNSL in the near future. Given the al-ready advanced state of classical simulated annealingcode, it is similarly conceivable that its application tothe QUBO form described here could be competitivewith other classical methods for solving BNSL . This work was supported by the AFRL Informa-tion Directorate under grant F4HBKC4162G001. Allopinions, findings, conclusions, and recommendationsexpressed in this material are those of the authorsand do not necessarily reflect the views of AFRL.The authors would also like to acknowledge supportfrom the NASA Advanced Exploration Systems pro-gram and NASA Ames Research Center. R. B. and A.A.-G. were supported by the National Science Foun-dation under award NSF CHE-1152291. The authorsare grateful to David Tempel, Ole Mengshoel, andEleanor Rieffel for useful discussions. AppendixA Calculation of ∆ ij Recall the definition of the auxillary quantity,∆ (cid:48) ji = − min { d ki | k (cid:54) = i,j } (cid:26) H ( i )score (cid:12)(cid:12)(cid:12) d ji =1 − H ( i )score (cid:12)(cid:12)(cid:12) d ji =0 (cid:27) , (32) Note that H ( i )score can be decomposed as H ( i )score = (cid:88) J ⊂{ ,...,n }\{ i }| J |≤ m (cid:32) w i ( J ) (cid:89) k ∈ J d ki (cid:33) = (cid:88) J ⊂{ ,...,n }\{ i,j }| J |≤ m (cid:32) w i ( J ) (cid:89) k ∈ J d ki (cid:33) + (cid:88) J ⊂{ ,...,n }\{ i,j }| J |≤ m − (cid:32) w i ( J ∪ { j } ) d ji (cid:89) k ∈ J d ki (cid:33) , (35)where the first term is independent of d ji and thuscancels in the argument of the minimization on theright-hand side of Equation 32. Thus ∆ ji simplifiesto∆ (cid:48) ji = − min { d ki | k (cid:54) = i,j } (cid:88) J ⊂{ ,...,n }\{ i,j }| J |≤ m − (cid:32) w i ( J ∪ { j } ) (cid:89) k ∈ J d ki (cid:33) = max { d ki | k (cid:54) = i,j } − (cid:88) J ⊂{ ,...,n }\{ i,j }| J |≤ m − (cid:32) w i ( J ∪ { j } ) (cid:89) k ∈ J d ki (cid:33) . (36)For m = 1, ∆ ji is trivially − w i ( { j } ), the constantvalue of the expression to be extremized in Equa-tion 36 regardless of the values of { d ki | k (cid:54) = i, j } . For m = 2, ∆ ji can still be calculated exactly:∆ (cid:48) ji = max { d ki | k (cid:54) = i,j } − (cid:88) J ⊂{ ,...,n }\{ i,j }| J |≤ (cid:32) w i ( J ∪ { j } ) (cid:89) k ∈ J d ki (cid:33) = max { d ki | k (cid:54) = i,j } − w i ( { j } ) − (cid:88) ≤ k ≤ nk (cid:54) = i,j d ki = − w i ( { j } ) − (cid:88) ≤ k ≤ nk (cid:54) = i,jw i { j,k } ) < w i ( { j, k } )= − w i ( { j } ) − (cid:88) ≤ k ≤ nk (cid:54) = i,j min { , w i ( { j, k } ) } . (37)However, for m ≥ 3, calculation of the extremum inEquation 36 is an intractable optimization problem in10ts own right and therefore we must resort to a rea-sonable bound. Because ∆ ji will be used in findinga lower bound on the necessary penalty weights, cau-tion ditates that we use, if needed, a greater valuethan necessary. A reasonable upper bound on thetrue value is:∆ (cid:48) ji = max { d ki | k (cid:54) = i,j } − (cid:88) J ⊂{ ,...,n }\{ i,j }| J |≤ m − (cid:32) w i ( J ∪ { j } ) (cid:89) k ∈ J d ki (cid:33) ≤ − (cid:88) J ⊂{ ,...,n }\{ i,j }| J |≤ m − w i ( J ∪{ j } ) < w i ( J ∪ { j } )= − (cid:88) J ⊂{ ,...,n }\{ i,j }| J |≤ m − min { , w i ( J ∪ { j } ) } . (38) B Proofs of Penalty WeightLower Bounds Claim 1 (Monotonicity of H max ) . If d ≥ d (cid:48) , then min y H max ( d , y ) ≥ min y H max ( d (cid:48) , y ) .Proof. d ≥ d (cid:48) implies d i ≥ d (cid:48) i and thus | d i | ≥ | d (cid:48) i | for 1 ≤ i ≤ n . By design, min y i H max ( d i , y i ) = δ ( i )max max { , | d i | − m } . Let y ∗ ≡ arg min y H ( d , y ).Thenmin y H max ( d , y ) = n (cid:88) i =1 min y i H max ( d i , y i )= n (cid:88) i =1 δ ( i )max max { , | d i | − m }≥ n (cid:88) i =1 δ ( i )max max { , | d (cid:48) i | − m } = n (cid:88) i =1 min y i H max ( d (cid:48) i , y i )= min y H max ( d (cid:48) , y ) . (39) Claim 2 (Monotonicity of H cycle ) . If d ≥ d (cid:48) , then min r H cycle ( d , r ) ≥ min r H cycle ( d (cid:48) , r ) .Proof. In the statement of the claim, we implicitlyassume that δ ( ij )consist > ≤ i < j ≤ n . Let r ∗ = arg min r H cycle ( d , r ). Because d ji ≥ d (cid:48) ji for all i, j such that i (cid:54) = j and 1 ≤ i, j ≤ n , and because 0 ≤ r ij ≤ ≤ i < j ≤ n ,min r H cycle ( d , r )= min r { H consist ( d , r ) + H trans ( r ) } = H consist ( d , r ∗ ) + H trans ( r ∗ )= (cid:88) ≤ i 1. By construction, G ( d ( l +1) )contains one fewer 2-cycle than G ( d ( l ) ). Further-more,min r H cycle ( d ( l ) , r ) − min r H cycle ( d ( l +1) , r )= H cycle ( d ( l ) , r ∗ ) − min r H cycle ( d ( l +1) , r ) ≥ H cycle ( d ( l ) , r ∗ ) − H cycle ( d ( l +1) , r ∗ )= (cid:34)(cid:32) (cid:88) ≤ i 2) max k / ∈{ i,j } δ ( ijk )trans for ≤ i < j ≤ n , then for all d such that G ( d ) con-tains no 2-cycle, H consist ( d , r ∗ ) = 0 , where r ∗ =arg min r H cycle ( d , r ) . Proof. We prove the claim via its contrapositive:for all d , r , if H consist ( d , r ) > 0, there is some r (cid:48) such that H cycle ( d , r ) > H cycle ( d , r (cid:48) ), so r (cid:54) =arg min r H cycle ( d , r ).Consider an arbitrary d and some r such that H consist ( d , r ) > 0. The positivity of H consist ( d , r )indicates that there is at least one inconsistency be-tween d and r , i.e. there is some ( i ∗ , j ∗ ) such that d i ∗ j ∗ = (cid:40) r j ∗ i ∗ , i ∗ > j ∗ − r i ∗ j ∗ , i ∗ < j ∗ = 1. For convenience,we prove the claim for the case in which i ∗ < j ∗ ;the proof provided can be easily modified for thecase in which i ∗ > j ∗ . Let r (cid:48) be the same as r exept in the bit corresponding to this inconsistency: r (cid:48) ij ≡ (cid:40) r ij ( i, j ) (cid:54) = ( i ∗ , j ∗ )1 − r ij , ( i, j ) = ( i ∗ , j ∗ ) . Then H consist ( d , r ) − H consist ( d , r (cid:48) )= (cid:88) ≤ i 2) max k / ∈{ i ∗ ,j ∗ } δ ( i ∗ j ∗ k )trans . (54)Together, the above imply H cycle ( d , r ) − H cycle ( d , r (cid:48) )= H consist ( d , r ) − H consist ( d , r (cid:48) ) + H trans ( r ) − H trans ( r (cid:48) ) ≥ δ ( i ∗ j ∗ )consist − ( n − 2) max k / ∈{ i,j } δ ( ijk )trans > . (55) Claim 7 (Sufficiency of “Transitivity” PenaltyWeights) . If δ ( ij )consist > ( n − 2) max k / ∈{ i,j } δ ( ijk )trans for ≤ i < j ≤ n and δ ( ijk )trans = δ trans > max ≤ i (cid:48) ,j (cid:48) ≤ ni (cid:48) (cid:54) = j (cid:48) ∆ i (cid:48) j (cid:48) for ≤ i < j < k ≤ n , then for all d such that G ( d ) does not contain a 2-cycle but does contain a directedcycle there is some d (cid:48) such that G ( d (cid:48) ) is a DAG and min y , r H ( d , y , r ) > min y , r H ( d (cid:48) , y , r ) .Proof. Consider an arbitrary d ( l ) such that G ( d ( l ) )does not contain a 2-cycle but does contain a directedcycle. Let r ( l ) ≡ arg min r H cycle ( d ( l ) , r ). By Claim6, H consist ( d ( l ) , r ( l ) ) = 0 and so H cycle ( d ( l ) , r ( l ) ) = H trans ( d ( l ) , r ( l ) ).If δ ( ijk )trans = δ trans for 1 ≤ i < j < k ≤ n , i.e.the trasitivity penalty weight is uniform for all di-rected triangles, then H trans ( r ) is equal to the prod-uct of δ trans and the number of directed triangles inthe tournament G ( r ) for all r .In any tournament with a positive number ofdirected triangles, there is always some arc whoseswitch of direction lowers the number of directed tri-angles. Let ( i ∗ , j ∗ ) be such such an arc for r ( l ) . Define˜ r ( l ) such that ˜ r ( l ) ij = (cid:40) r ( l ) ij , ( i, j ) (cid:54) = ( i ∗ , j ∗ )1 − r ( l ) ij , ( i, j ) = ( i ∗ , j ∗ ) . Byconstruction, H trans ( r ∗ l ) − H trans (˜ r ( l ) ) ≥ δ trans . It must be the case that d ( l ) i ∗ j ∗ = 1. Sup-pose otherwise. Define some r (cid:48) such that r (cid:48) ij = (cid:40) r ij , ( i, j ) (cid:54) = ( i ∗ , j ∗ ) , − r ij , ( i, j ) = ( i ∗ , j ∗ ) , , which wouldhave the properties that H consist ( d ( l ) , r (cid:48) ) = 0and, by construction, H trans ( r (cid:48) ) < H trans ( r ( l ) ,so that H cycle ( d ( l ) , r (cid:48) ) < H cycle ( d ( l ) , r ( l ) ) (cid:54) =min r H cycle ( d ( l ) , r ). Because G ( d ( l ) ) does not con-tain a 2-cycle, d ( l ) j ∗ i ∗ = 0.Now, define d ( l +1) such that d ( l +1) ij = (cid:40) d ( l ) ij , ( i, j ) (cid:54) = ( i ∗ , j ∗ ) , − d ( l ) ij , ( i, j ) = ( i ∗ , j ∗ ) . Then H consist ( d ( l +1) , ˜ r ( l ) )= (cid:88) ≤ i 2) max k / ∈{ i,j } δ ( ijk )trans for all ≤ i 2) max k / ∈{ i,j } δ ( ijk )trans for all ≤ i < j ≤ n and δ ( ijk )trans = δ trans > max ≤ i (cid:48) ,j (cid:48) ≤ ni (cid:48) (cid:54) = j (cid:48) ∆ i (cid:48) j (cid:48) for all ≤ i < j < k ≤ n ,then H ( d ∗ , y , r ) = min max i | d i |≤ mG ( d ) is a DAG H score ( d , y , r ) , G ( d ∗ ) is a DAG, and max i | d ∗ i | ≤ m , where d ∗ =arg min d { min y , r H ( d , y , r ) } . Proof. Consider an arbitrary d . If max i | d i | > m ,then by Claim 4, there exists some d (cid:48) such thatmin y , r H ( d , y , r ) > min y , r H ( d (cid:48) , y , r ) (62)and min y H max ( d (cid:48) ) = 0, i.e. max i | d (cid:48) i | ≤ m . If G ( d )has a directed cycle, then by Claim 8, there is some d (cid:48)(cid:48) such thatmin y , r H ( d , y , r ) > min y , r H ( d (cid:48)(cid:48) , y , r ) (63)and min r H cycle ( d (cid:48)(cid:48) , r ) = 0, i.e. G ( d (cid:48)(cid:48) ) is a DAG.Either of these cases implies that d (cid:54) = d ∗ , so itmust be that G ( d ∗ ) is a DAG, max i | d ∗ i | ≤ m , and H ( d ∗ , y , r ) = min max i | d i |≤ mG ( d ) is a DAG H score ( d , y , r ). References [1] Daphne Koller and Nir Friedman. Probabilis-tic Graphical Models: Principles and Techniques- Adaptive Computation and Machine Learning .The MIT Press, 2009.[2] Daren Yu, Xin Huang, Huaning Wang, Yan-mei Cui, Qinghua Hu, and Rui Zhou. Short-term solar flare level prediction using a bayesiannetwork approach. The Astrophysical Journal ,710(1):869, 2010.[3] Amira Djebbari and John Quackenbush. SeededBayesian Networks: Constructing genetic net-works from microarray data. BMC Systems Bi-ology , 2(1):57+, 2008.[4] Nir Friedman, Michal Linial, Iftach Nachman,and Dana Pe’er. Using Bayesian networksto analyze expression data. In Proceedingsof the Fourth Annual International Conferenceon Computational Molecular Biology , RECOMB’00, pages 127–135, New York, NY, USA, 2000.ACM.[5] David Maxwell Chickering. Learning Bayesiannetworks is np-complete. pages 121–130, 1996.[6] Scott Aaronson. BQP and the polynomial hier-archy. In Proceedings of the Forty-second ACMSymposium on Theory of Computing , STOC’10, pages 141–150, New York, NY, USA, 2010.ACM.[7] Lov K. Grover. A fast quantum mechanical al-gorithm for database search. In Proceedings ofthe Twenty-eighth Annual ACM Symposium onTheory of Computing , STOC ’96, pages 212–219,New York, NY, USA, 1996. ACM.158] Rolando D. Somma, Daniel Nagaj, and M´ariaKieferov´a. Quantum speedup by quantum an-nealing. Phys. Rev. Lett. , 109:050501, Jul 2012.[9] Troels F. Rønnow, Zhihui Wang, Joshua Job,Sergio Boixo, Sergei V. Isakov, David Wecker,John M. Martinis, Daniel A. Lidar, and MatthiasTroyer. Defining and detecting quantumspeedup. Science , 2014.[10] David Venturelli, Salvatore Mandr`a, SergeyKnysh, Bryan O’Gorman, Rupak Biswas, andVadim Smelyanskiy. Quantum optimization offully-connected spin glasses. 2014.[11] Robert R. Tucci. An introduction to quantumBayesian networks for mixed states. 2012.[12] Robert R. Tucci. Quantum circuit for discover-ing from data the structure of classical Bayesiannetworks. 2014.[13] Ryan Babbush, Alejandro Perdomo-Ortiz,Bryan O’Gorman, William Macready, andAl´an Aspuru-Guzik. Construction of EnergyFunctions for Lattice Heteropolymer Models:Efficient Encodings for Constraint SatisfactionProgramming and Quantum Annealing , pages201–244. John Wiley & Sons, Inc., 2014.[14] Alejandro Perdomo-Ortiz, Neil Dickson, Mar-shall Drew-Brook, Geordie Rose, and Al´anAspuru-Guzik. Finding low-energy conforma-tions of lattice protein models by quantum an-nealing. Scientific Reports , 2, 2012.[15] Eleanor G. Rieffel, Davide Venturelli, BryanO’Gorman, Minh Do, Elicia Prystay, and VadimSmelyanskiy. A case study in programming aquantum annealer for hard operational planningproblems. To be submitted , 2014.[16] Alejandro Perdomo-Ortiz, Joseph Fluegemann,Sriram Narasimhan, Vadim Smelyanskiy, andRupak Biswas. A quantum annealing approachfor fault detection and diagnosis of graph-basedsystems. To be submitted , 2014.[17] Frank Gaitan and Lane Clark. Graph isomor-phism and adiabatic quantum computing. Phys.Rev. A , 89:022342, Feb 2014.[18] Ryan Babbush, Vasil Denchev, Nan Ding, SergeiIsakov, and Hartmut Neven. Construction ofnon-convex polynomial loss functions for train-ing a binary classifier with quantum annealing.2014. [19] Vasil S. Denchev. Binary Classification with Adi-abatic Quantum Optimization . PhD thesis, Pur-due University, 2013.[20] Zhengbing Bian, Fabian Chudak, William G.Macready, Lane Clark, and Frank Gaitan. Ex-perimental determination of Ramsey numbers. Phys. Rev. Lett. , 111:130505, Sep 2013.[21] James Cussens. Bayesian network learning bycompiling to weighted MAX-SAT. In UAI , pages105–112, 2008.[22] S. V. Isakov, I. N. Zintchenko, T. F. Rønnow,and M. Troyer. Optimized simulated annealingcode for Ising spin glasses. 2014.[23] David Heckerman, Dan Geiger, andDavid Maxwell Chickering. Learning Bayesiannetworks: The combination of knowledge andstatistical data. Machine Learning , 20(3):197–243, 1995.[24] Tadashi Kadowaki and Hidetoshi Nishimori.Quantum annealing in the transverse Isingmodel. Phys. Rev. E , 58:5355–5363, Nov 1998.[25] Edward Farhi, Jeffrey Goldstone, Sam Gut-mann, and Michael Sipser. Quantum computa-tion by adiabatic evolution. 2000.[26] Roberto Oliveira and Barbara M. Terhal. Thecomplexity of quantum spin systems on a two-dimensional square lattice. Quantum Info. Com-put. , 8(10):900–924, November 2008.[27] W.M. Kaminsky and S. Lloyd. Scalable archi-tecture for adiabatic quantum computing of np-hard problems. In A.J. Leggett, B. Ruggiero,and P. Silvestrini, editors, Quantum Computingand Quantum Bits in Mesoscopic Systems , pages229–236. Springer US, 2004.[28] Alejandro Perdomo, Colin Truncik, Ivan Tubert-Brohman, Geordie Rose, and Al´an Aspuru-Guzik. Construction of model Hamiltonians foradiabatic quantum computation and its applica-tion to finding low-energy conformations of lat-tice protein models. Phys. Rev. A , 78:012320,Jul 2008.[29] Vicky Choi. Minor-embedding in adiabaticquantum computation: I. the parameter set-ting problem. Quantum Information Processing ,7(5):193–209, 2008.[30] Vicky Choi. Minor-embedding in adiabaticquantum computation: Ii. minor-universal graph16esign. Quantum Information Processing ,10(3):343–353, 2011.[31] Jun Cai, William G. Macready, and Aidan Roy.A practical heuristic for finding graph minors.2014.[32] Nir Friedman, Moises Goldszmidt, and AbrahamWyner. Data analysis with Bayesian networks:A bootstrap approach. In Proceedings of the Fif-teenth Conference on Uncertainty in ArtificialIntelligence , UAI’99, pages 196–205, San Fran- cisco, CA, USA, 1999. Morgan Kaufmann Pub-lishers Inc.[33] Ryan Babbush, Bryan O’Gorman, and AlnAspuru-Guzik. Resource efficient gadgets forcompiling adiabatic quantum optimization prob-lems. Annalen der Physik , 525(10-11):877–888,2013.[34] Endre Boros and Aritanan Gruber. On quadra-tization of pseudo-Boolean functions.