A Novel Method for Inference of Acyclic Chemical Compounds with Bounded Branch-height Based on Artificial Neural Networks and Integer Programming
Naveed Ahmed Azam, Jianshen Zhu, Yanming Sun, Yu Shi, Aleksandar Shurbevski, Liang Zhao, Hiroshi Nagamochi, Tatsuya Akutsu
aa r X i v : . [ c s . D S ] S e p A Novel Method for Inference of Acyclic ChemicalCompounds with Bounded Branch-height Based onArtificial Neural Networks and Integer Programming
Naveed Ahmed Azam , Jianshen Zhu , Yanming Sun , Yu Shi ,Aleksandar Shurbevski , Liang Zhao , Hiroshi Nagamochi , Tatsuya Akutsu , Abstract
Analysis of chemical graphs is becoming a major research topic in computational molec-ular biology due to its potential applications to drug design. One of the major ap-proaches in such a study is inverse quantitative structure activity/property relation-ships (inverse QSAR/QSPR) analysis, which is to infer chemical structures from givenchemical activities/properties. Recently, a novel framework has been proposed forinverse QSAR/QSPR using both artificial neural networks (ANN) and mixed integerlinear programming (MILP). This method consists of a prediction phase and an inverseprediction phase. In the first phase, a feature vector f ( G ) of a chemical graph G isintroduced and a prediction function ψ N on a chemical property π is constructed withan ANN N . In the second phase, given a target value y ∗ of the chemical property π , afeature vector x ∗ is inferred by solving an MILP formulated from the trained ANN N sothat ψ N ( x ∗ ) is close to y ∗ and then a set of chemical structures G ∗ such that f ( G ∗ ) = x ∗ is enumerated by a graph search algorithm. The framework has been applied to thecase of chemical compounds with cycle index up to 2 so far. The computational resultsconducted on instances with n non-hydrogen atoms show that a feature vector x ∗ canbe inferred for up to around n = 40 whereas graphs G ∗ can be enumerated for up toaround n = 15. When applied to the case of chemical acyclic graphs, the maximumcomputable diameter of G ∗ was around up to around 8. In this paper, we introduce anew characterization of graph structure, called “branch-height” based on which a newMILP formulation and a new graph search algorithm are designed for chemical acyclicgraphs. The results of computational experiments using such chemical properties asoctanol/water partition coefficient, boiling point and heat of combustion suggest thatthe proposed method can infer chemical acyclic graphs G ∗ with n = 50 and diameter30. Keywords:
QSAR/QSPR, Molecular Design, Artificial Neural Network, Mixed In-teger Linear Programming, Enumeration of Graphs
Mathematics Subject Classification:
Primary 05C92, 92E10, Secondary 05C30,68T07, 90C11, 92-04
In computational molecular biology, various types of data have been utilized, which include se-quences, gene expression patterns, and protein structures. Graph structured data have also beenextensively utilized, which include metabolic pathways, protein-protein interaction networks, generegulatory networks, and chemical graphs. Much attention has recently been paid to analysis1f chemical graphs due to its potential applications to computer-aided drug design. One of themajor approaches to computer-aided drug design is quantitative structure activity/property re-lationships (QSAR/QSPR) analysis, the purpose of which is to derive quantitative relationshipsbetween chemical structures and their activities/properties. Furthermore, inverse QSAR/QSPRhas been extensively studied [13, 19], the purpose of which is to infer chemical structures fromgiven chemical activities/properties. Inverse QSAR/QSPR is often formulated as an optimiza-tion problem to find a chemical structure maximizing (or minimizing) an objective function undervarious constraints.In both QSAR/QSPR and inverse QSAR/QSPR, chemical compounds are usually representedas vectors of real or integer numbers, which are often called descriptors and correspond to featurevectors in machine learning. Using these chemical descriptors, various heuristic and statisticalmethods have been developed for finding optimal or nearly optimal graph structures under givenobjective functions [8, 13, 17]. Inference or enumeration of graph structures from a given featurevector is a crucial subtask in many of such methods. Various methods have been developed for thisenumeration problem [6, 10, 12, 16] and the computational complexity of the inference problemhas been analyzed [1, 14]. On the other hand, enumeration in itself is a challenging task, sincethe number of molecules (i.e., chemical graphs) with up to 30 atoms (vertices) C , N , O , and S , mayexceed 10 [4].As a new approach, artificial neural network (ANN) and deep learning technologies have re-cently been applied to inverse QSAR/QSPR. For example, variational autoencoders [7], recurrentneural networks [18, 23], and grammar variational autoencoders [11] have been applied. In theseapproaches, new chemical graphs are generated by solving a kind of inverse problems on neuralnetworks that are trained using known chemical compound/activity pairs. However, the optimal-ity of the solution is not necessarily guaranteed in these approaches. In order to guarantee theoptimality mathematically, a novel approach has been proposed [2] for ANNs, using mixed integerlinear programming (MILP).Recently, a new framework has been proposed [3, 5, 24] by combining two previous approaches;efficient enumeration of tree-like graphs [6], and MILP-based formulation of the inverse problemon ANNs [2]. This combined framework for inverse QSAR/QSPR mainly consists of two phases.The first phase solves (I) Prediction Problem , where a feature vector f ( G ) of a chemical graph G is introduced and a prediction function ψ N on a chemical property π is constructed with anANN N using a data set of chemical compounds G and their values a ( G ) of π . The second phasesolves (II) Inverse Problem , where (II-a) given a target value y ∗ of the chemical property π , afeature vector x ∗ is inferred from the trained ANN N so that ψ N ( x ∗ ) is close to y ∗ and (II-b) thena set of chemical structures G ∗ such that f ( G ∗ ) = x ∗ is enumerated by a graph search algorithm.In (II-a) of the above-mentioned previous methods [3, 5, 24], an MILP is formulated for acyclicchemical compounds. Afterwards, Ito et al. [9] and Zhu et al. [25] designed a method of inferringchemical graphs with cycle index 1 and 2, respectively by formulating a new MILP and usingan efficient algorithm for enumerating chemical graphs with cycle index 1 [20] and cycle index 2[21, 22]. The computational results conducted on instances with n non-hydrogen atoms show thata feature vector x ∗ can be inferred for up to around n = 40 whereas graphs G ∗ can be enumeratedfor up to around n = 15. 2n this paper, we present a new characterization of graph structure, called “branch-height.”Based on this, we can treat a class of acyclic chemical graphs with a structure that is topologicallyrestricted but frequently appears in the chemical database, formulate a new MILP formulation thatcan handle acyclic graphs with a large diameter, and design a new graph search algorithm thatgenerates acyclic chemical graphs with up to 50 vertices. The results of computational experimentsusing such chemical properties as octanol/water partition coefficient, boiling point and heat ofcombustion suggest that the proposed method is much more useful than the previous method.The paper is organized as follows. Section 2 introduces some notions on graphs, a modelingof chemical compounds and a choice of descriptors. Section 3 reviews the framework for inferringchemical compounds based on ANNs and MILPs. Section 4 introduces a new method of modelingacyclic chemical graphs and proposes a new MILP formulation that represents an acyclic chemicalgraph G with n vertices, where our MILP requires only O ( n ) variables and constraints when thebranch-parameter k and the k -branch-height in G (graph topological parameters newly introducedin this paper) is constant. Section 5 describes the idea of our new dynamic programming typeof algorithm that enumerates a given number of acyclic chemical graphs for a given feature vec-tor. Section 6 reports the results on some computational experiments conducted for s chemicalproperties such as octanol/water partition coefficient, boiling point and heat of combustion. Sec-tion 7 makes some concluding remarks. Appendix A provides the statistical feature on structureof acyclic chemical graphs in a chemical graph database. Appendix B describes the details of allvariables and constraints in our MILP formulation. Appendix C presents descriptions of our newgraph search algorithm. This section introduces some notions and terminology on graphs, a modeling of chemical com-pounds and our choice of descriptors.Let R , Z and Z + denote the sets of reals, integers and non-negative integers, respectively. Fortwo integers a and b , let [ a, b ] denote the set of integers i with a ≤ i ≤ b . A graph stands for a simple undirected graph, where an edge joining two vertices u and v isdenoted by uv (= vu ). The sets of vertices and edges of a graph G are denoted by V ( G ) and E ( G ), respectively. Let H = ( V, E ) be a graph with a set V of vertices and a set E of edges. Fora vertex v ∈ V , the set of neighbors of v in H is denoted by N H ( v ), and the degree deg H ( v ) of v is defined to be | N H ( v ) | . The length of a path is defined to be the number of edges in the path.The distance dist H ( u, v ) between two vertices u, v ∈ V is defined to be the minimum length of apath connecting u and v in H . The diameter dia( H ) of H is defined to be the maximum distancebetween two vertices in H ; i.e., dia( H ) , max u,v ∈ V dist H ( u, v ). Denote by ℓ ( P ) the length of apath P . Trees
For a tree T with an even (resp., odd) diameter d , the center is defined to be the vertex v (resp., the adjacent vertex pair { v, v ′ } ) that situates in the middle of one of the longest paths with3ength d . The center of each tree is uniquely determined. Rooted Trees A rooted tree is defined to be a tree where a vertex (or a pair of adjacent vertices)is designated as the root . Let T be a rooted tree, where for two adjacent vertices u and v , vertex u is called the parent of v if u is closer to the root than v is. The height height( v ) of a vertex v in T is defined to be the maximum length of a path from v to a leaf u in the descendants of v , whereheight( v ) = 0 for each leaf v in T . Figure 1(a) and (b) illustrate examples of trees rooted at thecenter. Degree-bounded Trees
For positive integers a, b and c with b ≥
2, let T ( a, b, c ) denote the rootedtree such that the number of children of the root is a , the number of children of each non-rootinternal vertex is b and the distance from the root to each leaf is c . We see that the number ofvertices in T ( a, b, c ) is a ( b c − / ( b −
1) + 1, and the number of non-leaf vertices in T ( a, b, c ) is a ( b c − − / ( b −
1) + 1. In the rooted tree T ( a, b, c ), we denote the vertices by v , v , . . . , v n with abreadth-first-search order, and denote the edge between a vertex v i with i ∈ [2 , n ] and its parentby e i , where n = a ( b c − / ( b −
1) + 1 and each vertex v i with i ∈ [1 , a ( b c − − / ( b −
1) + 1] isa non-leaf vertex. For each vertex v i in T ( a, b, c ), let Cld( i ) denote the set of indices j such that v j is a child of v i , and prt( i ) denote the index j such that v j is the parent of v i when i ∈ [2 , n ].Let P prc ( a, b, c ) be a set of ordered index pairs ( i, j ) of vertices v i and v j in T ( a, b, c ). We call P prc ( a, b, c ) proper if the next conditions hold:(a) For each subtree H = ( V, E ) of T ( a, b, c ) with v ∈ V , there is at least one subtree H ′ =( V ′ , E ′ ) such that- H ′ is isomorphic to H by a graph isomorphism ψ : V → V ′ with ψ ( v ) = v ; and- for each pair ( i, j ) ∈ P prc ( a, b, c ), if v j ∈ V ′ then v i ∈ V ′ ; and(b) For each pair of vertices v i and v j in T ( a, b, c ) such that v i is the parent of v j , there is asequence ( i , i ) , ( i , i ) , . . . , ( i k − , i k ) of index pairs in P prc ( a, b, c ) such that i = i and i k = j .Note that a proper set P prc ( a, b, c ) is not necessarily unique. Branch-height in Trees
In this paper, we introduce “branch-height” of a tree as a new measureto the “agglomeration degree” of trees. We specify a non-negative integer k , called a branch-parameter to define branch-height. First we regard T as a rooted tree by choosing the center of T as the root. Figure 1(a) and (b) illustrate examples of rooted trees. We introduce the followingterminology on a rooted tree T .- A leaf k -branch : a non-root vertex v in T such that height( v ) = k .- A non-leaf k -branch : a vertex v in T such that v has at least two children u with height( u ) ≥ k . We call a leaf or non-leaf k -branch a k -branch . Figure 2(a)-(c) illustrate the k -branchesof the rooted tree H in Figure 1(b) for k = 1 , k -branch-path : a path P in T that joins two vertices u and u ′ such that each of u and u ′ isthe root or a k -branch and P does not contain the root or a k -branch as an internal vertex.- The k -branch-subtree of T : the subtree of T that consists of the edges in all k -branch-pathsof T . We call a vertex (resp., an edge) in T a k -internal vertex (resp., a k -internal edge ) if it4 a) H (b) H centercenter v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v (c) k =2 root v v v v v Figure 1: An illustration of rooted trees and a 2-branch-tree: (a) A tree H with odd diameter 11;(b) A tree H with even diameter 10; (c) The 2-branch-tree of H .is contained in the k -branch-subtree of T and a k -external vertex (resp., a k -external edge )otherwise. Let V in and V ex (resp., E in and E ex ) denote the sets of k -internal and k -externalvertices (resp., edges) in T .- The k -branch-tree of T : the rooted tree obtained from the k -branch-subtree of T by replacingeach k -branch-path with a single edge. Figure 1(c) illustrates the 2-branch-tree of the rootedtree H in Figure 1(b).- A k -fringe-tree : One of the connected components that consists of the edges not in any k -branch-subtree. Each k -fringe-tree T ′ contains exactly one vertex v in a k -branch-subtree,where T ′ is regarded as a tree rooted at v . Note that the height of any k -fringe-tree is atmost k . Figure 2(a)-(c) illustrate the k -fringe-tree of the rooted tree H in Figure 1(b) for k = 1 , k -branch-leaf-number bl k ( T ): the number of leaf k -branches in T . For the trees H i , i = 1 , ( H ) = bl ( H ) = 8, bl ( H ) = bl ( H ) = 5,bl ( H ) = bl ( H ) = 3 and bl ( H ) = bl ( H ) = 2.- The k -branch-height bh k ( T ) of T : the maximum number of non-root k -branches along a pathfrom the root to a leaf of T ; i.e., bh k ( T ) is the height of the k -branch-tree T ∗ (the maximumlength of a path from the root to a leaf in T ∗ ). For the example of trees H i , i = 1 , ( H ) = bh ( H ) = 5, bh ( H ) = bh ( H ) = 3,bh ( H ) = bh ( H ) = 2 and bh ( H ) = bh ( H ) = 1.We observe that most chemical graphs G with at most 50 non-hydrogen atoms satisfy bh ( G ) ≤
2. See Appendix A for a summary of statistical feature of chemical graphs registered in the chemicaldatabase PubChem. 5 oot v v v v v v v v v v v v v v v v v v v v v (a) k =1 v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v (b) k =2 (c) k =3 root root Figure 2: An illustration of the k -branches (depicted by gray circles), the k -branch-subtree (de-picted by solid lines) and k -fringe-trees (depicted by dashed lines) of H : (a) k = 1; (b) k = 2; (c) k = 3. We represent the graph structure of a chemical compound as a graph with labels on vertices andmultiplicity on edges in a hydrogen-suppressed model. Let Λ be a set of labels each of whichrepresents a chemical element such as C (carbon), O (oxygen), N (nitrogen) and so on, where weassume that Λ does not contain H (hydrogen). Let mass( a ) and val( a ) denote the mass and valenceof a chemical element a ∈ Λ, respectively. In our model, we use integers mass ∗ ( a ) = ⌊ · mass( a ) ⌋ , a ∈ Λ and assume that each chemical element a ∈ Λ has a unique valence val( a ) ∈ [1 , < over the elements in Λ according to their mass values; i.e., wewrite a < b for chemical elements a , b ∈ Λ with mass( a ) < mass( b ). Choose a set Γ < of tuples γ = ( a , b , m ) ∈ Λ × Λ × [1 ,
3] such that a < b . For a tuple γ = ( a , b , m ) ∈ Λ × Λ × [1 , γ denote the tuple ( b , a , m ). Set Γ > = { γ | γ ∈ Γ < } and Γ = = { ( a , a , m ) | a ∈ Λ , m ∈ [1 , } . A pairof two atoms a and b joined with a bond-multiplicity m is denoted by a tuple γ = ( a , b , m ) ∈ Γ,called the adjacency-configuration of the atom pair.We use a hydrogen-suppressed model because hydrogen atoms can be added at the final stage.A chemical graph over Λ and Γ < ∪ Γ = is defined to be a tuple G = ( H, α, β ) of a graph H = ( V, E ),a function α : V → Λ and a function β : E → [1 ,
3] such that(i) H is connected;(ii) P uv ∈ E β ( uv ) ≤ val( α ( u )) for each vertex u ∈ V ; and(iii) ( α ( u ) , α ( v ) , β ( uv )) ∈ Γ < ∪ Γ = for each edge uv ∈ E .For a notational convenience, we denote the sum of bond-multiplicities of edges incident to a vertexas follows: β ( u ) , X uv ∈ E β ( uv ) for each vertex u ∈ V .A chemical graph G = ( H, α, β ) is called a “chemical monocyclic graph” if the graph H is amonocyclic graph. Similarly for other types of graphs for H .6e define the bond-configuration of an edge e = uv ∈ E in a chemical graph G to be atuple (deg H ( u ) , deg H ( v ) , β ( e )) such that deg H ( u ) ≤ deg H ( v ) for the end-vertices u and v of e .Let Bc denote the set of bond-configurations µ = ( d , d , m ) ∈ [1 , × [1 , × [1 ,
3] such thatmax { d , d } + m ≤
4. We regard that ( d , d , m ) = ( d , d , m ). For two tuples µ = ( d , d , m ) , µ ′ =( d ′ , d ′ , m ′ ) ∈ Bc, we write µ ≥ µ ′ if max { d , d } ≥ max { d ′ , d ′ } , min { d , d } ≥ min { d ′ , d ′ } and m ≥ m ′ , and write µ > µ ′ if µ ≥ µ ′ and µ = µ ′ . In our method, we use only graph-theoretical descriptors for defining a feature vector, whichfacilitates our designing an algorithm for constructing graphs. Given a chemical acyclic graph G = ( H, α, β ), we define a feature vector f ( G ) that consists of the following 11 kinds of descriptors.We choose an integer k ∗ ∈ [1 ,
4] as a branch-parameter.- n ( G ): the number | V | of vertices.- dg in i ( G ), i ∈ [1 , k ∗ -internal vertices of degree i in H ; i.e., dg in i ( G ) , |{ v ∈ V in | deg H ( v ) = i }| , where the multiplicity of edges incident to a vertex v is ignored in thedegree of v .- dg ex i ( G ), i ∈ [1 , k ∗ -external vertices of degree i in H ; i.e., dg ex i ( G ) , |{ v ∈ V ex | deg H ( v ) = i }| .- dia( G ): the diameter of H divided by | V | ; i.e., dia( G ) , dia( H ) /n ( G ).- bl k ∗ ( G ): the k ∗ -branch-leaf-number of G .- bh k ∗ ( G ): the k ∗ -branch-height of G .- ce in a ( G ), a ∈ Λ: the number of k ∗ -internal vertices with label a ∈ Λ; i.e., ce in a ( G ) , |{ v ∈ V in | α ( v ) = a }| .- ce ex a ( G ), a ∈ Λ: the number of k ∗ -external vertices with label a ∈ Λ; i.e., ce ex a ( G ) , |{ v ∈ V ex | α ( v ) = a }| .- ms( G ): the average mass ∗ of atoms in G ; i.e., ms( G ) , P v ∈ V mass ∗ ( α ( v )) /n ( G ).- bd in m ( G ), m = 2 ,
3: the number of double and triple bonds of k ∗ -internal edges; i.e., bd in m ( G ) , { e ∈ E in | β ( e ) = m } , m = 2 , ex m ( G ), m = 2 ,
3: the number of double and triple bonds of k ∗ -internal edges; i.e., bd ex m ( G ) , { e ∈ E ex | β ( e ) = m } , m = 2 , in γ ( G ), γ = ( a , b , m ) ∈ Γ: the number of adjacency-configurations ( a , b , m ) of k ∗ -internaledges in G .- ac ex γ ( G ), γ = ( a , b , m ) ∈ Γ: the number of adjacency-configurations ( a , b , m ) of k ∗ -externaledges in G . 7 bc in µ ( G ), µ = ( d, d ′ , m ) ∈ Bc: the number of bond-configurations ( d, d ′ , m ) of k ∗ -internaledges in G .- bc ex µ ( G ), µ = ( d, d ′ , m ) ∈ Bc: the number of bond-configurations ( d, d ′ , m ) of k ∗ -externaledges in G .- n H ( G ): the number of hydrogen atoms; i.e., n H ( G ) , X a ∈ Λ , t ∈{ in , ex } val( a )ce t a ( G ) − X γ =( a , b ,m ) ∈ Γ , t ∈{ in , ex } m · ac t γ ( G ))= X a ∈ Λ , t ∈{ in , ex } val( a )ce t a ( G ) − n ( G ) − X m ∈ [2 , , t ∈{ in , ex } m · bd t m ( G )).The number K of descriptors in our feature vector x = f ( G ) is K = 2 | Λ | + 2 | Γ | + 50. Note thatthe set of the above K descriptors is not independent in the sense that some descriptor depends onthe combination of other descriptors in the set. For example, descriptor bd in i ( G ) can be determinedby P γ =( a , b ,m ) ∈ Γ: m = i ac in γ ( G ). We review the framework that solves the inverse QSAR/QSPR by using MILPs [9, 25], which isillustrated in Figure 3. For a specified chemical property π such as boiling point, we denote by a ( G ) the observed value of the property π for a chemical compound G . As the first phase, we solve(I) Prediction Problem with the following three steps.
Phase 1.Stage 1:
Let DB be a set of chemical graphs. For a specified chemical property π , choose aclass G of graphs such as acyclic graphs or monocyclic graphs. Prepare a data set D π = { G i | i = 1 , , . . . , m } ⊆ G ∩ DB such that the value a ( G i ) of each chemical graph G i , i = 1 , , . . . , m isavailable. Set reals a, a ∈ R so that a ≤ a ( G i ) ≤ a , i = 1 , , . . . , m . Stage 2:
Introduce a feature function f : G → R K for a positive integer K . We call f ( G ) the feature vector of G ∈ G , and call each entry of a vector f ( G ) a descriptor of G . Stage 3:
Construct a prediction function ψ N with an ANN N that, given a vector in R K , returnsa real in the range [ a, a ] so that ψ N ( f ( G )) takes a value nearly equal to a ( G ) for many chemicalgraphs in D . See Figure 3(a) for an illustration of Stages 1 ,2 and 3 in Phase 1.In this paper, we use the range-based method to define an applicability domain (AD) [15] to ourinverse QSAR/QSPR. Set x j and x j to be the minimum and maximum values of the j -th descriptor x j in f ( G i ) over all graphs G i , i = 1 , , . . . , m (where we possibly normalize some descriptors suchas ce in a ( G ), which is normalized with ce in a ( G ) /n ( G )). Define our AD D to be the set of vectors x ∈ R K such that x j ≤ x j ≤ x j for the variable x j of each j -th descriptor, j = 1 , , . . . , k .In the second phase, we try to find a vector x ∗ ∈ R K from a target value y ∗ of the chemicalpropery π such that ψ N ( x ∗ ) = y ∗ . Based on the method due to Akutsu and Nagamochi [2],8 R K x* MILP g * inputoutput R G*G * M ( x , y , g ; C , C ) M ( x , y ; C ) M ( x , g ; C ) Stage 5 x* AD DA y ( f ( G* ) ) = y* N no G* G s.t. detectdeliver Stage 4 G : class of chemical graphs R K x : = f ( G ) a ( G ) y ( x ) ANN N R G a : property function Stage 2Stage 1 Stage 3 f ( G ) N AD DA N ... f : feature function y : prediction function N G ... f ( G* ) y ( x *)= y * N G (a) Phase 1 (b) Phase 2 f ( G* ) i y* : target value Figure 3: (a) An illustration of Phase 1: Stage 1 for preparing a data set D π for a graph class G and a specified chemical property π ; Stage 2 for introducing a feature function f with descriptors;Stage 3 for constructing a prediction function ψ N with an ANN N ; (b) An illustration of Phase 2:Stage 4 for formulating an MILP M ( x, y, g ; C , C ) and finding a feasible solution ( x ∗ , g ∗ ) of theMILP for a target value y ∗ so that ψ N ( x ∗ ) = y ∗ (possibly detecting that no target graph G ∗ exists);Stage 5 for enumerating graphs G ∗ ∈ G such that f ( G ∗ ) = x ∗ .Chiewvanichakorn et al. [5] showed that this problem can be formulated as an MILP. By includinga set of linear constraints such that x ∈ D into their MILP, we obtain the next result. Theorem 1. ([9, 25])
Let N be an ANN with a piecewise-linear activation function for an inputvector x ∈ R K , n A denote the number of nodes in the architecture and n B denote the total numberof break-points over all activation functions. Then there is an MILP M ( x, y ; C ) that consists ofvariable vectors x ∈ D ( ⊆ R K ) , y ∈ R , and an auxiliary variable vector z ∈ R p for some integer p = O ( n A + n B ) and a set C of O ( n A + n B ) constraints on these variables such that: ψ N ( x ∗ ) = y ∗ if and only if there is a vector ( x ∗ , y ∗ ) feasible to M ( x, y ; C ) . See Appendix B.1 for the set of constraints to define our AD D in the MILP M ( x, y ; C ) inTheorem 1.A vector x ∈ R K is called admissible if there is a graph G ∈ G such that f ( G ) = x [3]. Let A denote the set of admissible vectors x ∈ R K . To ensure that a vector x ∗ inferred from a giventarget value y ∗ becomes admissible, we introduce a new vector variable g ∈ R q for an integer q .For the class G of chemical acyclic graphs, Azam et al. [3] introduced a set C of new constraintswith a new vector variable g ∈ R q for an integer q so that a feasible solution ( x ∗ , g ∗ ) of a newMILP for a target value y ∗ delivers a vector x ∗ with ψ N ( x ∗ ) = y ∗ and a vector g ∗ that representsa chemical acyclic graph G ∗ ∈ G . Afterwards, for the classes of chemical graphs with cycle index1 and 2, Ito et al. [3] and Zhu et al. [25] presented such a set C of constraints so that a vector g ∗ in a feasible solution ( x ∗ , g ∗ ) of a new MILP can represent a chemical graph G ∗ in the class G ,respectively.As the second phase, we solve (II) Inverse Problem for the inverse QSAR/QSPR by treatingthe following inference problems.(II-a) Inference of Vectors 9 nput:
A real y ∗ with a ≤ y ∗ ≤ a . Output:
Vectors x ∗ ∈ A ∩ D and g ∗ ∈ R q such that ψ N ( x ∗ ) = y ∗ and g ∗ forms a chemical graph G ∗ ∈ G with f ( G ∗ ) = x ∗ .(II-b) Inference of Graphs Input:
A vector x ∗ ∈ A ∩ D . Output:
All graphs G ∗ ∈ G such that f ( G ∗ ) = x ∗ .The second phase consists of the next two steps. Phase 2.Stage 4:
Formulate Problem (II-a) as the above MILP M ( x, y, g ; C , C ) based on G and N . Finda feasible solution ( x ∗ , g ∗ ) of the MILP such that x ∗ ∈ A ∩ D and ψ N ( x ∗ ) = y ∗ (where the second requirement may be replaced with inequalities (1 − ε ) y ∗ ≤ ψ N ( x ∗ ) ≤ (1 + ε ) y ∗ for a tolerance ε > Stage 5:
To solve Problem (II-b), enumerate all (or a specified number) of graphs G ∗ ∈ G suchthat f ( G ∗ ) = x ∗ for the inferred vector x ∗ . See Figure 3(b) for an illustration of Stages 4 and 5 inPhase 2. In this paper, we choose a branch-parameter k ≥ G of chemical acyclic graphs G such that- the maximum degree in G is at most 4;- the k -branch height bh k ( G ) is bounded for a specified branch-parameter k ; and- the size of each k -fringe-tree in G is bounded.The reason why we restrict ourselves to the graphs in G is that this class G covers a large partof the acyclic chemical compounds registered in the chemical database PubChem. See Appendix Afor a summary of the statical feature of the chemical graphs in PubChem in terms of k -branchheight and the size of 2-fringe-trees. According to this, over 55% (resp., 99%) of acyclic chemicalcompounds with up to 100 non-hydrogen atoms in PubChem have the maximum degree 3 (resp.,4); and nearly 87% (resp., 99%) of acyclic chemical compounds with up to 50 non-hydrogen atomsin PubChem has the 2-branch height at most 1 (resp., 2). This implies that k = 2 is sufficientto cover the most of chemical acyclic graphs. For k = 2, over 92% of 2-fringe-trees of chemicalcompounds with up to 100 non-hydrogen atoms in PubChem obey the following size constraint: n ≤ d + 2 for each 2-fringe-tree T with n vertices and d children of the root. (1)We formulate an MILP in Stage 4 that, given a target value y ∗ , infers a vector x ∗ ∈ Z K + with ψ N ( x ∗ ) = y ∗ and a chemical acyclic graph G ∗ = ( H, α, β ) ∈ G with f ( G ∗ ) = x ∗ . We here specifysome of the features of a graph G ∗ ∈ G such as the number of non-hydrogen atoms in order tocontrol the graph structure of target graphs to be inferred and to simplify MILP formulations. Inthis paper, we specify the following features on a graph G ∈ G : a set Λ of chemical elements, a10et Γ < of adjacency-configuration, the maximum degree, the number of non-hydrogen atoms, thediameter, the k -branch-height and the k -branch-leaf-number for a branch-parameter k .More formally, given specified integers n ∗ , d max , dia ∗ , k ∗ , bh ∗ , bl ∗ ∈ Z other than Λ and Γ, let H ( n ∗ , d max , dia ∗ , k ∗ , bh ∗ , bl ∗ ) denote the set of acyclic graphs H such thatthe maximum degree of a vertex is at most 3 when d max = 3 (or equal to 4 when d max = 4),the number n ( H ) of vertices in H is n ∗ ,the diameter dia( H ) of H is dia ∗ ,the k ∗ -branch-height bh k ∗ ( H ) is bh ∗ ,the k ∗ -branch-leaf-number bl k ∗ ( H ) is bl ∗ and(1) holds.To design Stage 4 for our class G , we formulate an MILP M ( x, g ; C ) that infers a chemi-cal graph G ∗ = ( H, α, β ) ∈ G with H ∈ H ( n ∗ , d max , dia ∗ , k ∗ , bh ∗ , bl ∗ ) for a given specification(Λ , Γ , n ∗ , d max , dia ∗ , k ∗ , bh ∗ , bl ∗ ) The details will be given in Section 4 and Appendix B.Design of Stage 5; i.e. generating chemical graphs G ∗ that satisfy f ( G ∗ ) = x ∗ for a givenfeature vector x ∗ ∈ Z K + is still challenging for a relatively large instance with size n ( G ∗ ) ≥ G ∗ in Stage 5 for the classesof graphs with cycle index 0 to 2 [6, 20, 21, 22]. All of these are designed based on the branch-and-bound method and can generate a target chemical graph with size n ( G ∗ ) ≤
20. To breakthis barrier, we newly employ the dynamic programming method for designing an algorithm inStage 5 in order to generate a target chemical graph G ∗ with size n ( G ∗ ) = 50. For this, we furtherrestrict the structure of acyclic graphs G so that the number bl ( G ) of leaf 2-branches is at most3. Among all acyclic chemical compounds with up to 50 non-hydrogen atoms in the chemicaldatabase PubChem, the ratio of the number of acyclic chemical compounds G with bl ( G ) ≤ ( G ) ≤
3) is 78% (resp., 95%). See Section 5 for the details on the new algorithm inStage 5.
In this section, we formulate an MILP M ( x, g ; C ) to infer a chemical acyclic graph G in the class G for a given specification (Λ , Γ , n ∗ , d max , dia ∗ , k ∗ , bh ∗ , bl ∗ ) defined in the previous section. We introduce a directed graph with size O ( n ∗ · ( d max − max { bh ∗ ,k ∗ } + ( d max − bh ∗ + k ∗ ), called a scheme graph SG, so that an acyclic graph H ∈ H ( n ∗ , d max , dia ∗ , k ∗ , bh ∗ , bl ∗ ) can be chosen fromthe scheme graph SG. Let t ∗ , s ∗ and c ∗ be integers such that t ∗ = n ∗ − (bh ∗ − − ( k ∗ + 1)bl ∗ ,s ∗ = a ( b c − / ( b −
1) + 1 for a = d max , b = d max − c = bh ∗ , c ∗ = s ∗ − . d max , k ∗ , bh ∗ , t ∗ ) consist of a tree T B , a path P t ∗ , a set { S s | s ∈ [1 , s ∗ ] } oftrees, a set { T t | t ∈ [1 , t ∗ ] } of trees, and a set of directed edges between T B and P t ∗ so that anacyclic graph H ∈ H ( n ∗ , d max , dia ∗ , k ∗ , bh ∗ , bl ∗ ) will be constructed in the following way:(i) The k ∗ -branch-tree of H will be chosen as a subtree of T B = ( V B , E B );(ii) Each k ∗ -fringe-tree rooted at a vertex u s ∈ V ( T B ) of H will be chosen as a subtree of S s ;(iii) Each k -branch-path of H (except for its end-vertices) will be chosen as a subpath of P t ∗ oras an edge in T B ;(iv) Each k ∗ -fringe-tree rooted at a vertex v t ∈ V ( P t ∗ ) of H will be chosen as a subtree of T t ; and(v) An edge ( u, v ) directed from T B to P t ∗ will be selected as an initial edge of a k ∗ -branch-pathof H and an edge ( v, u ) directed from P t ∗ to T B will be selected as an ending edge of a k ∗ -branch-path of H .More formally each component of a scheme graph SG( d max , k ∗ , bh ∗ , t ∗ ) is defined as follows.(i) T B = ( V B = { u , u , . . . , u s ∗ } , E B = { a , a . . . , a c ∗ } ), called a base-tree is a tree rooted ata vertex u that is isomorphic to the rooted tree T ( d max , d max − , bh ∗ ). Regard T B as anordered tree by introducing a total order for each set of siblings and call the first (resp.,last) child in a set of siblings the leftmost (resp. rightmost) child, which defines the leftmost(rightmost) path from the root u to a leaf in T B , as illustrated in Figure 4(a).For each vertex u s ∈ V B , let E B ( s ) denote the set of indices i of edges a ( i ) ∈ E B incident to u s and Cld B ( s ) denote the set of indices i of children u i ∈ V B of u s in the tree T B .For each integer d ∈ [0 , k ∗ ], let V B ( d ) denote the set of indices s of vertices u s ∈ V B whosedepth is d in the tree T B , where V B (bh ∗ ) is the set of indices s of leaves u s of T B .Regard each edge a i ∈ E B as a directed edge ( u s , u s ′ ) from one end-vertex u s of a i to theother end-vertex u s ′ of a i such that s = prt( s ′ ) (i.e., u s is the parent of u s ′ ), where head( i )and tail( i ) denote the head u s ′ and tail u s of edge a i ∈ E B , respectively.For each index s ∈ [1 , s ∗ ], let E + B ( s ) (resp., E − B ( s )) denote the set of indices i of edges a i ∈ E B such that the tail (resp., head) of a i is vertex u s .Let L B denote the set of indices of leaves of T B , and s left (resp., s right ) denote the index s ∈ L B of the leaf u s at which the leftmost (resp., rightmost) path from the root ends.For each leaf u s , s ∈ L B , let V B,s (resp., E B,s ) denote the set of indices s of non-root vertices u s (resp., indices i of edges a ( i ) ∈ E B ) along the path from the root to the leaf u s in the tree T B .For the example of a base-tree T B with bh ∗ = 2 in Figure 4, it holds that L B = { , , , , , } , s left = 5, s right = 10, E B,s left = { , } and V B,s left = { , } .(ii) S s , s ∈ [1 , s ∗ ] is a tree rooted at vertex u s ∈ V B in T B that is isomorphic to the rooted tree T ( d max − , d max − , k ∗ ), as illustrated in Figure 4(b). Let u s,i and e ′ s,i denote the vertex and12dge in S s that correspond to the i -th vertex and the i -th edge in T ( d max − , d max − , k ∗ ),respectively. Regard each edge e ′ s,i as a directed edge ( u s, prt( i ) , u s,i ). For this, each vertex u s ∈ V B is also denoted by u s, .(iii) P t ∗ = ( V P = { v , v , . . . , v t ∗ } , E P = { e , e , . . . , e t ∗ } ), called a link-path with size t ∗ is adirected path from vertex v to vertex v t ∗ , as illustrated in Figure 4(a). Each edge e t ∈ E P is directed from vertex v t − to vertex v t .(iv) T t , t ∈ [1 , t ∗ ] is a tree rooted at vertex v t in P t ∗ that is isomorphic to the rooted tree T ( d max − , d max − , k ∗ ), as illustrated in Figure 4(c). Let v t,i and e t,i denote the vertex andedge in T t that correspond to the i -th vertex and the i -th edge in T ( d max − , d max − , k ∗ ),respectively. Regard each edge e t,i as a directed edge ( v t, prt( i ) , u t,i ). For this, each vertex v t ∈ V P is also denoted by v t, .(v) For every pair ( s, t ) with s ∈ [1 , s ∗ ] and t ∈ [1 , t ∗ ], join vertices u s and v t with directed edges( u s , v t ) and ( v t , u s ), as illustrated in Figure 4(a). v v = v t* ,1 T T = S s* = T t* S S S S T v e v t ,4 v t ,2 v t ,3 e t ,4 e t ,2 e t ,3 v t ,1 u s ,4 u s ,5 u s ,2 u s ,6 e’ s ,4 e’ s ,5 e’ s ,2 e’ s ,6 e’ s ,3 u s ,1 = u s (b) S s =T ( d max - , d max - , k * ) (c) T t =T ( d max - , d max - , k * ) u u u u u a a a a a u u u u a a a a u v v e T T e e u s ,7 u s ,3 e’ s ,7 T B =T ( d max , d max - , bh * ) (a) P t* Figure 4: An illustration of scheme graph SG( d max , k ∗ , bh ∗ , t ∗ ) with d max = 3, k ∗ = 2, bh ∗ = 2,and t ∗ = 5, where the vertices in T B (resp., in P t ∗ ) are depicted with black (resp., gray) circles:(a) A base-tree T B and a link-path P t ∗ are joined with directed edges between them; (b) A tree S s rooted at a vertex u s = u s, ∈ V B ; (c) A tree T t rooted at a vertex v t = v t, ∈ V P .Figure 5(a) illustrates an acyclic graph H with n ( H ) = 37, dia( H ) = 17, bh ( H ) = 2 andbl ( H ) = 3, where the maximum degree of a vertex is 3. Figure 5(b) illustrates the 2-branch-tree13f the acyclic graph H in Figure 5(a). Figure 5(c) illustrates a subgraph H ′ of the scheme graphSG( d max , k ∗ , bh ∗ , t ∗ = n ∗ − bl ∗ −
1) such that H ′ is isomorphic to the acyclic graph H in Figure 5(a). a u u u u u v v v v v v v v v v e e e e e’ e’ e’ e e’ u u u u e’ e’ e’ u u u e’ e’ u u e’ e’ u u e’ u e e e v v v e e e v v v e e v v e e v e v e u u u u u v v v v v v v v v v (a) H (b) u u u u u a a a a (c) H’ Figure 5: An illustration of selecting a subgraph H from the scheme graph SG( d max , k ∗ , bh ∗ , t ∗ = n ∗ − bl ∗ − H ∈ H ( n ∗ , d max , dia ∗ , k ∗ , bh ∗ , bl ∗ ) with n ∗ = 37, d max = 3,dia ∗ ( H ) = 17, k ∗ = 2, bh ∗ = 2 and bl ∗ = 3, where the labels of some vertices indicate thecorresponding vertices in the scheme graph SG( d max , k ∗ , bh ∗ , t ∗ ); (b) The k ∗ -branch-tree of H for k ∗ = 2; (c) An acyclic graph H ′ selected from SG( d max , k ∗ , bh ∗ , t ∗ ) as a graph that is isomorphicto H in (a).In this paper, we obtain the following result. Theorem 2.
Let Λ be a set of chemical elements, Γ be a set of adjacency-configurations, where | Λ | ≤ | Γ | , and K = | Λ | + | Γ | + 28 . Given non-negative integers n ∗ ≥ , d max ∈ { , } , dia ∗ ≥ , k ∗ ≥ , bh ∗ ≥ and bl ∗ ≥ , there is an MILP M ( x, g ; C ) that consists of variable vectors x ∈ R K and g ∈ R q for an integer q = O ( | Γ | · [( d max − bh ∗ + k ∗ + n ∗ · ( d max − max { bh ∗ ,k ∗ } )]) and a set C of O ( | Γ | + ( d max − bh ∗ + k ∗ + n ∗ · ( d max − max { bh ∗ ,k ∗ } )) constraints on x and g such that: ( x ∗ , g ∗ ) is feasible to M ( x, g ; C ) if and only if g ∗ forms a chemical acyclic graph G = ( H, α, β ) ∈ G (Λ , Γ) such that H ∈ H ( n ∗ , d max , dia ∗ , k ∗ , bh ∗ , bl ∗ ) and f ( G ) = x ∗ . Note that our MILP requires only O ( n ∗ ) variables and constraints when the branch-parameter k ∗ , the k ∗ -branch height and | Γ | are constant. We formulate an MILP in Theorem 2 so that sucha graph H is selected as a subgraph of the scheme graph.We explain the basic idea of our MILP. The MILP mainly consists of the following three typesof constraints.C1. Constraints for selecting an acyclic graph H as a subgraph of the scheme graph SG( d max , k ∗ , bh ∗ , t ∗ );C2. Constraints for assigning chemical elements to vertices and multiplicity to edges to determinea chemical graph G = ( H, α, β ); andC3. Constraints for computing descriptors from the selected acyclic chemical graph G .14n the constraints of C1, more formally we prepare the following.(i) In the scheme graph SG( d max , k ∗ , bh ∗ , t ∗ ), we prepare a binary variable u ( s,
1) for each vertex u s = u s, ∈ V B , s ∈ [1 , s ∗ ] so that vertex u s = u s, becomes a k ∗ -branch of a selected graph H if and only if u ( s,
1) = 1. The subgraph of the base-tree T B that consists of vertices u s = u s, with u ( s,
1) = 1 will be the k ∗ -branch-tree of the graph H . We also prepare a binary variable a ( i ), i ∈ [1 , c ∗ ] for each edge a i ∈ E B , where c ∗ = s ∗ −
1. For a pair of a vertex u s, anda child u s ′ , of u s, such that u ( s,
1) = u ( s ′ ,
1) = 1, either the edge a i = ( u s, , u s ′ , ) is usedin the selected graph H (when a ( i ) = 1) or a path P i = ( u s, , v t ′ , , v t ′ +1 , , . . . , v t ′′ , , u s ′ , )from vertex u s, to vertex u s ′ , is constructed in H with an edge ( u s, , v t ′ , ), a subpath( v t ′ , , v t ′ +1 , , . . . , v t ′′ , ) of the link-path P t ∗ and an edge ( v t ′′ , , u s ′ , ) (when a ( i ) = 0). Forexample, vertices u , and u , are connected by a path P = ( u , , v , , v , , u , ) in theselected graph H ′ in Figure 5(c).(ii) Let n Stree = 1 + ( d max − d max − k ∗ − / ( d max − ,n Ttree = 1 + ( d max − d max − k ∗ − / ( d max − , where n Stree (resp., n Ttree ) is the numbers of vertices in the rooted tree T ( d max − , d max − , k ∗ )(resp., T ( d max − , d max − , k ∗ )). In each tree S s , s ∈ [1 , s ∗ ] (resp., T t , t ∈ [1 , t ∗ ]) in the schemegraph, we prepare a binary variable u ( s, i ) (resp., v ( t, i )) for each vertex u s,i , i ∈ [2 , n Stree ](resp., v t,i , i ∈ [2 , n Ttree ]) so that u ( s, i ) = 1 (resp., v ( t, i ) = 1) means that the correspondingvertex u s,i (resp., v t,i ) is used as a vertex in a selected graph H . The (non-empty) subgraphof a tree S s (resp., T t ) that consists of vertices u s,i with u ( s, i ) = 1 (resp., v t,i with v ( t, i ) = 1)will be a k ∗ -fringe-tree of a selected graph H .(iii) In the link-path P t ∗ , we prepare a binary variable e ( t ), t ∈ [2 , t ∗ ] for each edge e t, =( v t − , , v t, ) ∈ E P so that e ( t ) = 1 if and only if edge e t, is used in some path P i =( u s, , v t ′ , , v t ′ +1 , , . . . , v t ′′ , , u s ′ , ) constructed in (i).(iv) For each pair ( s, t ) of s ∈ [1 , s ∗ ] and t ∈ [1 , t ∗ ], we prepare a binary variable e ( s, t ) (resp., e ( t, s )) so that e ( s, t ′ ) = 1 (resp., e ( t ′′ , s ) = 1) if and only if directed edge ( u s, , v t ′ , ) (resp.,( v t ′′ , , u s, )) is used as the first edge (resp., last edge) of some path P i = ( u s, , v t ′ , , v t ′ +1 , , . . . ,v t ′′ , , u s ′ , ) constructed in (i).Based on these, we include constraints with some more additional variables so that a selectedsubgraph H is a connected acyclic graph. See constraints (13) to (33) in Appendix B for thedetails.In the constraints of C2, we prepare an integer variable e α ( u ) for each vertex u in the schemegraph that represents the chemical element α ( u ) ∈ Λ if u is in a selected graph H (or e α ( u ) = 0otherwise) and an integer variable e β ( e ) ∈ [0 ,
3] (resp., b β ( e ) ∈ [0 , e (resp., e = e ( s, t )or e ( t, s ), s ∈ [1 , s ∗ ], t ∈ [1 , t ∗ ]) in the scheme graph that represents the multiplicity β ( e ) ∈ [1 ,
3] if e is in a selected graph H (or e β ( e ) or b β ( e ) takes 0 otherwise). This determines a chemical graph G = ( H, α, β ). Also we include constraints for a selected chemical graph G to satisfy the valence15ondition ( α ( u ) , α ( v ) , β ( uv )) ∈ Γ for each edge uv ∈ E . See constraints (34) to (48) in Appendix Bfor the details.In the constraints of C3, we introduce a variable for each descriptor and constraints with somemore variables to compute the value of each descriptor in f ( G ) for a selected chemical graph G .See constraints (49) to (77) in Appendix B for the details. The algorithm used in Stage 5 in the previous methods of inferring chemical acyclic graphs [3, 5, 24]are all based on the branch-and-bound algorithm proposed by Fujiwara et al. [6] where an enormousnumber of chemical graphs are constructed by repeatedly appending and removing a vertex oneby one until a target chemical graph is constructed. Their algorithm cannot generate even oneacyclic chemical graph when n ( G ) is larger than around 20.This section designs a new dynamic programming method for designing an algorithm in Stage 5.We consider the following aspects:(a) Treat acyclic graphs with a certain limited structure that frequently appears among chemicalcompounds registered in the chemical data base; and(b) Instead of manipulating acyclic graphs directly, first compute the frequency vectors fff ( G ′ )(some types of feature vectors) of subtrees G ′ of all target acyclic graphs and then constructa limited number of target graphs G from the process of computing the vectors.In (a), we choose a branch-parameter k ∗ = 2 and treat acyclic graphs G that have a small2-branch number such as bl ( G ) ∈ [2 , G with bl ( G ) = 2 and bl ( G ) = 3, respectively.We design a method in (b) based on the mechanism of dynamic programming wherein the firstphase computes some compressed forms of all substructures of target objects before the secondphase realizes a final object based on the computation process of the first phase.Section 5.1 defines a frequency vector fff ( G ) that represents a feature vector f ( G ) of a chemicalgraph G . Section 5.2 presents the idea and a sketch of our new algorithms for generating acyclicgraphs G with bl ( G ) ∈ [2 , For a finite set A of elements, let Z A + denote the set of functions : A → Z + . A function ∈ Z A + is called a non-negative integer vector (or a vector) on A and the value xxx ( a ) for an element a ∈ A is called the entry of xxx for a ∈ A . For a vector ∈ Z A + and an element a ∈ A , let + 111 a (resp., − a ) denote the vector ′ such that ′ ( a ) = ( a ) + 1 (resp., ′ ( a ) = ( a ) −
1) and ′ ( b ) = ( b )for the other elements b ∈ A \ { a } . For a vector ∈ Z A + and a subset B ⊆ A , let [ B ] denote the projection of to B ; i.e., [ B ] ∈ Z B + such that [ B ] ( b ) = ( b ), b ∈ B .Let Bc denote the set of tuples µ = ( d , d , k ) ∈ [1 , × [1 , × [1 ,
3] (bond-configuration)such that max { d , d } + k ≤
4. We regard that ( d , d , k ) = ( d , d , k ). For two tuples µ =( d , d , k ) , µ ′ = ( d ′ , d ′ , k ′ ) ∈ Bc, we write µ ≥ µ ′ if max { d , d } ≥ max { d ′ , d ′ } , min { d , d } ≥ a) b l ( G )=2 length = dia* -4 height = 2 v length = dia* -4 height = 2 v v v v v (b) bl ( G )=3 v’ v’ v’ T T T T Figure 6: An illustration of chemical acyclic graphs G with diameter dia ∗ and bl ( G ) = 2 ,
3: (a) Achemical acyclic graph G with two leaf 2-branches v and v ; (b) A chemical acyclic graph G withthree leaf 2-branches v , v and v .min { d ′ , d ′ } and k ≥ k ′ , and write µ > µ ′ if µ ≥ µ ′ and µ = µ ′ . Let Dg = { dg1 , dg2 , dg3 , dg4 } ,where dg i denotes the number of vertices with degree i .Henceforth we deal with vectors that have their in and ex components, both in ex ∈ Z Λ ∪ Γ ∪ Bc ∪ Dg+ , and for convenience we write = ( in ex ) in the sense of concatenation.For a vector xxx = ( xxx in , xxx ex ) with xxx in , xxx ex ∈ Z Λ ∪ Γ ∪ Bc ∪ Dg+ , let G ( xxx ) denote the set of chemical acyclicgraphs G that satisfy the following:ce in a ( G ) = xxx in ( a ) and ce ex a ( G ) = xxx ex ( a ) for each chemical element a ∈ Λ,ac in γ ( G ) = xxx in ( γ ) and ac ex γ ( G ) = xxx ex ( γ ) for each adjacency-configuration γ ∈ Γ,bc in µ ( G ) = xxx in ( µ ) and bc ex µ ( G ) = xxx ex ( µ ) for each bond-configuration µ ∈ Bc,dg in i ( G ) = xxx in (dg i ) and dg ex i ( G ) = xxx ex (dg i ) for each degree dg i ∈ Dg.Throughout the section, let k ∗ = 2 be a branch-parameter, xxx ∗ = ( xxx ∗ in , xxx ∗ ex ) be a given featurevector with xxx ∗ in , xxx ∗ ex ∈ Z Λ ∪ Γ ∪ Bc ∪ Dg+ , and dia ∗ be an integer. We infer a chemical acyclic graph G ∈ G ( xxx ∗ ) such that bl ( G ) ∈ [2 ,
3] and the diameter of G is dia ∗ , where n ∗ = P a ∈ Λ ( xxx ∗ in ( a )+ xxx ∗ ex ( a )).Note that any other descriptors of G ∈ G ( xxx ∗ ) can be determined by the entries of vector xxx ∗ .To infer a chemical acyclic graph G ∈ G ( xxx ∗ ), we consider a connected subgraph T of G thatconsists of - a subtree of the 2-branch-subtree G ′ of G and- the 2-fringe-trees rooted at vertices in G ′ . (2)Our method first generates a set F T of all possible rooted trees T that can be a 2-fringe-treeof a chemical graph G ∈ G ( xxx ∗ ), and then extends the trees T by repeatedly appending a tree in F T until a chemical graph G ∈ G ( xxx ∗ ) is formed. In the extension, we actually manipulate the“frequency vectors” of trees defined below.To specify which part of a given tree T plays a role of 2-internal vertices/edges or 2-externalvertices/edges in a chemical graph G ∈ G ( xxx ∗ ) to be inferred, we designate at most three vertices r ( T ), r ( T ) and r ( T ) in T as terminals , and call T rooted (resp., bi-rooted and tri-rooted ) if thenumber of terminals is one (resp., two and three). For a rooted tree (resp., bi- or tri-rooted tree) T , let e V in denote the set of vertices contained in a path between two terminals of T , e E in denote theset of edges in T between two vertices in e V in , and define e V ex , V ( T ) \ e V in and e E ex , E ( T ) \ e E in .17or a bi- or tri-rooted tree T , define the backbone path P T of T to be the path of T between vertices r ( T ) and r ( T ).Given a chemical acyclic graph T , define fff t ( T ), t ∈ { in , ex } to be the vector ∈ Z Λ ∪ Γ ∪ Bc ∪ Dg+ that consists of the following entries:- ( a ) = |{ v ∈ e V t | α ( v ) = a }| , a ∈ Λ,- ( γ ) = |{ uv ∈ e E t | { α ( u ) , α ( v ) } = { a , b } , β ( uv ) = q }| , γ = ( a , b , q ) ∈ Γ,- ( µ ) = |{ uv ∈ e E t | { deg T ( u ) , deg T ( v ) } = { d, d ′ } , β ( uv ) = m }| , µ = ( d, d ′ , m ) ∈ Bc,- (dg i ) = |{ v ∈ e V t | deg T ( v ) = i }| , dg i ∈ Dg.Define fff ( T ) , ( fff in ( T ) , fff ex ( T )). The entry for an element e ∈ Λ ∪ Γ ∪ Bc ∪ Dg in fff t ( T ), t ∈{ in , ex } is denoted by fff t ( e ; T ). For a subset B of Λ ∪ Γ ∪ Bc ∪ Dg, let fff t [ B ] ( T ) denote the projec-tion of fff t ( T ) to B .Our aim is to generate all chemical bi-rooted (resp., tri-rooted) trees T with diameter dia ∗ suchthat fff ( T ) = xxx ∗ . This section describes the idea and a sketch of our new graph search algorithms. bl ( G ) = 2We call a chemical graph G ∈ G ( x ∗ ) with diameter dia ∗ and bl ( G ) = 2 a target graph .A chemical acyclic graph G with bl ( G ) = 2 has exactly two leaf 2-branches v i , i = 1 , v and v of a target graph G isdia ∗ − k ∗ = dia ∗ −
4. We observe that a connected subgraph T of a target graph G that satisfies(2) for bl ( G ) = 2 is a chemical rooted or bi-rooted tree. We call such a subgraph T an internal-subtree (resp., end-subtree ) of G if neither (resp., one) of u and v is a 2-branch in G . When u = v ,we call an internal-subtree (resp., end-subtree) T of G an internal-fringe-tree (resp., end-fringe-tree ) of G . Figure 7(a)-(d) illustrate an internal-subtree, an internal-fringe-tree, an end-subtreeand an end-fringe-tree of G .Let δ = ⌊ dia ∗ − ⌋ and δ = dia ∗ − − δ = ⌈ dia ∗ − ⌉ . We regard a target graph G ∈ G ( x ∗ ) withbl ( G ) = 2 and diameter dia ∗ as a combination of two chemical bi-rooted trees T and T with ℓ ( P T i ) = δ i , i = 1 , e = r ( T ) r ( T ), as illustrated in Figure 8.We start with generating chemical rooted trees and then iteratively extend chemical bi-rootedtrees T with ℓ ( P T ) = 1 , , . . . , δ before we finally combine two chemical bi-rooted trees T and T with ℓ ( P T i ) = δ i . To describe our algorithm, we introduce some notations.- Let T ( x ∗ ) denote the set of all bi-rooted trees T (where possibly r ( T ) = r ( T )) such that fff in ( T ) ≤ xxx ∗ in and fff ex ( T ) ≤ xxx ∗ ex , which is a necessary condition for T to be an internal-subtreeor end-subtree of a target graph G ∈ G ( x ∗ ).- Let F T denote the set of all rooted trees T ∈ T ( x ∗ ) that can be a 2-fringe-tree of a targetgraph G , where T satisfies the size constraint (1) of 2-fringe-trees.18 a) an internal-subtree T of G v v v uPuv (d) an end-fringe-tree T of G v v u (c) an end-subtree T of G v v v uPuv (b) an internal-fringe-tree T of G v v v=u u Figure 7: An illustration of subtrees T of a chemical acyclic graph G in Figure 6(a), where thevertices/edges in T are depicted by solid lines: (a) An internal-subtree T of G ; (b) An internal-fringe-tree T of G ; (c) An end-subtree T of G ; (d) An end-fringe-tree T of G . length = d length = d dia*-52=dia*-52= a a d d m m r ( T ) r ( T ) r ( T ) r ( T ) mT [ + T [ + Figure 8: An illustration of combining two bi-rooted trees T = T and T = T with a new edgewith multiplicity m joining vertices r ( T ) and r ( T ) to construct a target graph G , where a i ∈ Λ, d i ∈ [1 , d max − m i ∈ [ d i , val( a i ) − i = 1 , m ∈ [1 , min { , val( a ) − m , val( a ) − m } ].- For each integer h ∈ [1 , dia ∗ − T ( h )end denote the set of all bi-rooted trees T ∈ T ( x ∗ ) thatcan be an end-subtree of a target graph G such that ℓ ( P T ) = h , and each 2-fringe-tree T v rooted at a vertex v in P T belongs to F T .We remark that the size |T ( h )end | of trees will be enormously large for n ∗ ≥
25 and dia ∗ ≥ G by enumerating trees in T ( h )end directly neverworks for such a large size of instances. The idea of our new algorithm is to compute only the setW ( h )end of frequency vectors of these trees, whose size | W ( h )end | is much more restricted than that of T ( h )end . We compute the set W ( h ) of frequency vectors of trees in T ( h )end iteratively for each integer h ≥
0. During the computation, we keep a sample of a tree T for each of such frequency vectors so that a final step can construct some number of target graphs G by assembling these sampletrees. Based on this, we generate target graphs G ∈ G ( x ∗ ) by the following steps:1. (i) Compute F T by a branch-and-bound procedure that generates all possible rooted trees19 ∈ T ( xxx ∗ ) (where r ( T ) = r ( T )) that can be a 2-fringe-tree of a target graph G ∈ G ( x ∗ );(ii) Compute the set W (0) of all vectors = ( in ex ) such that in = fff in ( T ) and ex = fff ex ( T ) for some tree T ∈ F T ;(iii) For each vector = ( in ex ) ∈ W (0) , choose a sample tree T ∈ F T such that in = fff in ( T ) and ex = fff ex ( T ), and store these sample trees;2. For each integer h = 1 , , . . . , δ , iteratively execute the next:(i) Compute the set W ( h )end of all vectors = ( in ex ) such that in = fff in ( T ) and ex = fff ex ( T ) for some bi-rooted tree T ∈ T ( h )end , where such a vector is obtained from a combinationof vectors ′ ∈ W (0) and ′′ ∈ W ( h − ;(ii) For each vector ∈ W ( h )end , store a sample tree T , which is obtained from a combinationof sample trees T ′ with ′ ∈ W (0) and T ′′ with ′′ ∈ W ( h − ;3. We call a pair of vectors ∈ W ( δ )end and ∈ W ( δ )end feasible if it admits a target graph G ∈ G ( xxx ∗ ) such that + ≤ xxx ∗ in and + ≤ xxx ∗ ex . Find the set W pair of all feasiblepairs of vectors and ;4. For each feasible vector pair ( ) ∈ W pair , construct a corresponding target graph G bycombining the corresponding samples trees T and T , as illustrated in Figure 8.For a relatively large instance with n ∗ ≥
40 and dia ∗ ≥
20, the number | W pair | of feasible vectorpairs in Step 4 is still very large. In fact, the size | W ( h )end | of a vector set W ( h )end to be computed inStep 2 can also be considerably large during an execution of the algorithm. For such a case, weimpose a time limitation on the running time for computing W ( h )end and a memory limitation onthe number of vectors stored in a vector set W ( h )end . With these limitations, we can compute only alimited subset c W ( h )end of each vector set W ( h )end in Step 2. Even with such a subset c W ( h )end , we still canfind a large size of a subset c W pair of W pair in Step 3.Our algorithm also delivers a lower bound on the number of all target graphs G ∈ G ( x ∗ ) in thefollowing way. In Step 1, we also compute the number t ( ) of trees T ∈ F T such that = fff ( T )for each ∈ W (0) . In Step 2, when a vector is constructed from two vectors ′ and ′′ , weiteratively compute the number t ( ) of trees T such that = fff ( T ) by t ( ) := t ( ′ ) × t ( ′′ ). InStep 3, when a feasible vector pair ( ) ∈ W pair is obtained, we know that the number of thecorresponding target graphs G is t ( ) × t ( ). Possibly we compute a subset c W pair of W pair inStep 3. Then (1 / P ( ) ∈ c W pair t ( ) × t ( ) gives a lower bound on the number of target graphs G ∈ G ( x ∗ ), where we divided by 2 since an axially symmetric target graph G can correspond totwo vector pairs in W pair .Detailed descriptions of the five steps in the above algorithm can be found in Appendix C. bl ( G ) = 3We call a chemical graph G ∈ G ( x ∗ ) with diameter dia ∗ and bl ( G ) = 3 a target graph . Let n ∗ inl , P a ∈ Λ xxx ∗ in ( a ), which is the number of 2-internal vertices in a target graph G ∈ G ( xxx ∗ ).A chemical acyclic graph G with bl ( G ) = 3 has exactly three leaf 2-branches v i , i = 1 , v adjacent to three 2-internal vertices v i , i = 1 , ,
3, as illustrated inFigure 6(b). We call vertex v the joint-vertex of G . Without loss of generality assume that the20ength of the path P v ,v between v and v is dia ∗ − P v ,v ′ is notsmaller than that of P v ,v ′ .Analogously with the case of bl ( G ) = 2, we define internal-subtree (resp., end-subtree , internal-fringe-tree and end-fringe-tree ) of G to be a connected subgraph G ′ that satisfies (2). Observe that G can be partitioned into three end-subtrees T i , i = 1 , ,
3, the 2-fringe-tree T rooted at the joint-vertex v and three edges v i v , i = 1 , ,
3, where the backbone path P T i connects leaf 2-branch v i and vertex v ′ i . In particular, we call the end-subtree of G that consists of T , T , T and edges v i v , i = 1 , main-subtree of G , which consists of the path P v ,v and all the 2-fringe-trees rootedat vertices in P v ,v . We call T the co-subtree of G .Let δ i , i = 1 , , T i . Note that δ + δ + 2 = dia ∗ − δ ≥ δ ≥ δ = n ∗ inl − dia ∗ + 2 , from which δ ∈ [ δ , ⌊ dia ∗ / ⌋ −
3] and δ ∈ [ ⌈ dia ∗ / ⌉ − , dia ∗ − − δ ] . We regard a target graph G ∈ G ( x ∗ ) with bl ( G ) = 3 and diameter dia ∗ as a combinationof the main-subtree and the co-subtree joined with an edge. We represent the co-subtree as achemical bi-rooted tree T with ℓ ( P T ) = δ . We represent the main-subtree of a target graph G asa tri-rooted tree T with ℓ ( P T ) = dia − r ( T ), r ( T ) and r ( T ) correspond tothe two leaf 2-branches and the joint-vertex of G , respectively. length = dia* -4 length = d a a d m r ( T ) r ( T ) r ( T ) r ( T ) m < > T [ + r ( T ) T + length = d +1 d m Figure 9: An illustration of combining a tri-rooted T = T and a bi-rooted tree T = T with anew edge joining vertices r ( T ) and r ( T ) to construct a target graph G .We start with generating chemical rooted trees and then iteratively extend chemical bi-rootedtrees T with ℓ ( P T ) = 1 , , . . . , dia ∗ − − δ before we combine two chemical bi-rooted trees T ′ and T ′ to obtain a chemical tri-rooted tree T with ℓ ( P T ) = δ i and finally combine a chemicaltri-rooted tree T and a chemical bi-rooted trees T with ℓ ( P T ) = δ , to obtain a target graph G ∈ G ( x ∗ ).Analogously with the case of bl ( G ) = 2, we define the set T ( x ∗ ) of all bi-rooted trees T , theset F T of all rooted trees T ∈ T ( x ∗ ) that can be a 2-fringe-tree of a target graph G and the set21 ( h )end , h ∈ [1 , dia ∗ − − δ ]) of all bi-rooted trees T ∈ T ( x ∗ ) that can be an end-subtree of a targetgraph G such that ℓ ( P T ) = h .We generate target graphs G ∈ G ( x ∗ ) by the following steps:1. Analogously with Step 1 for the case of bl ( G ) = 2, compute the set F T and the set W (0) ofall vectors = ( in ex ) such that in = fff in ( T ) and ex = fff ex ( T ) for some tree T ∈ F T .For each vector ∈ W (0) , store a sample tree T ∈ F T ;2. For each integer h = 1 , , . . . , dia ∗ − − δ , compute the set W ( h )end of all vectors = ( in ex )such that in = fff in ( T ) and ex = fff ex ( T ) for some bi-rooted tree T ∈ T ( h )end ; For each vector ∈ W ( h )end , store a sample tree T ;3. For each integer h ∈ [ ⌈ dia ∗ / ⌉ − , dia ∗ − − δ ], compute the set W ( h )end+2 of all vectors =( in ex ) such that in = fff in ( T ) and ex = fff ex ( T ) of some bi-rooted tree T with ℓ ( P T ) = h that represents an end-subtree rooted at the joint-vertex; For each vector ∈ W ( h )end+2 , storea sample tree T ;4. For each integer δ ∈ [ ⌈ dia ∗ / ⌉ − , dia ∗ − − δ ], compute the set W ( δ +1)main of all vectors = ( in ex ) such that in = fff in ( T ) and ex = fff ex ( T ) for some tri-rooted tree T thatrepresents the main-subtree such that the length of the path P r ( T ) ,r ( T ) between terminals r ( T ) and r ( T ) is δ + 1. For each vector ∈ W ( δ +1)main , store a sample tree T ;5. We call a pair of vectors ∈ W ( δ +1)main and ∈ W ( δ )end feasible if it admits a target graph G ∈ G ( xxx ∗ ) such that + ≤ xxx ∗ in and + ≤ xxx ∗ ex . Find the set W pair of all feasiblepairs of vectors and ;6. For each feasible vector pair ( ) ∈ W pair , construct a corresponding target graph G bycombining the samples trees T and T , which correspond to the main-subtree and theco-subtree of a target graph G , respectively, as illustrated in Figure 9.Detailed descriptions of the six steps in the above algorithm can be found in Appendix C. We implemented our method of Stages 1 to 5 for inferring chemical acyclic graphs and conductedexperiments to evaluate the computational efficiency for three chemical properties π : octanol/waterpartition coefficient ( K ow ), boiling point ( Bp ) and heat of combustion ( Hc ). We executed theexperiments on a PC with Two Intel Xeon CPUs E5-1660 v3 @3.00GHz, 32 GB of RAM runningunder OS: Ubuntu 14.04.6 LTS. We show 2D drawings of some of the inferred chemical graphs,where ChemDoodle version 10.2.0 is used for constructing the drawings. Results on Phase 1.
We implemented Stages 1, 2 and 3 in Phase 1 as follows.
Stage 1.
We set a graph class G to be the set of all chemical acyclic graphs, and set a branch-parameter k ∗ to be 2. For each property π ∈ { K ow , Bp, Hc } , we first select a set Λ of chemicalelements and then collected a data set D π on chemical acyclic graphs over the set Λ of chemical22able 1: Results of Stage 1 in Phase 1. π Λ | D π | | Γ | [ n, n ] [bl , bl] [bh , bh] [ a, a ] K ow C,O,N
216 10 [4, 28] [0, 2] [0, 4] [-4.2, 8.23] Bp C,O,N
172 10 [4, 26] [0, 1] [0, 3] [-11.7, 404.84] Hc C,O,N
128 6 [4, 26] [0, 1] [0, 2] [1346.4, 13304.5]elements provided by HSDB from PubChem. To construct the data set, we eliminated chemicalcompounds that have at most three carbon atoms or contain a charged element such as N + or anelement a ∈ Λ whose valence is different from our setting of valence function val.Table 1 shows the size and range of data sets that we prepared for each chemical property inStage 1, where we denote the following:- π : one of the chemical properties K ow , Bp and Hc ;- Λ: the set of selected chemical elements (hydrogen atoms are added at the final stage);- | D π | : the size of data set D π over Λ for property π ;- | Γ | : the number of different adjacency-configurations over the compounds in D π ;- [ n, n ]: the minimum and maximum number n ( G ) of non-hydrogen atoms over the compounds G in D π ;- [bl , bl]: the minimum and maximum numbers bl ( G ) of leaf 2-branches over the compounds G in D π ;- [bh , bh]: the minimum and maximum values of the 2-branch height bh ( G ) over the com-pounds G in D π ; and- [ a, a ]: the minimum and maximum values of a ( G ) in π over compounds G in D π . Stage 2.
We used a feature function f that consists of the descriptors defined in Section 2.Table 2: Results of Stages 2 and 3 in Phase 1. π K Activation Architecture L-Time test R (ave.) test R (best) K ow
76 ReLU (76,10,1) 2.12 0.901 0.951 Bp
76 ReLU (76,10,1) 26.07 0.935 0.965 Hc
68 ReLU (68,10,1) 234.06 0.924 0.98823 tage 3.
We used scikit-learn version 0.21.6 with Python 3.7.4 to construct ANNs N wherethe tool and activation function are set to be MLPRegressor and ReLU, respectively. We testedseveral different architectures of ANNs for each chemical property. To evaluate the performance ofthe resulting prediction function ψ N with cross-validation, we partition a given data set D π into fivesubsets D ( i ) π , i ∈ [1 ,
5] randomly, where D π \ D ( i ) π is used for a training set and D ( i ) π is used for a testset in five trials i ∈ [1 , { y , y , . . . , y N } of observed values and a set { ψ , ψ , . . . , ψ N } of predicted values, we define the coefficient of determination to be R , − P j ∈ [1 ,N ] ( y j − ψ j ) P j ∈ [1 ,N ] ( y j − y ) , where y = N P j ∈ [1 ,N ] y j . Table 2 shows the results on Stages 2 and 3, where- K : the number of descriptors for the chemical compounds in data set D π for property π ;- Activation: the choice of activation function;- Architecture: ( a, b,
1) consists of an input layer with a nodes, a hidden layer with b nodesand an output layer with a single node, where a is equal to the number K of descriptors;- L-time: the average time (sec) to construct ANNs for each trial;- test R (ave): the average of coefficient of determination over the five tests; and- test R (best): the largest value of coefficient of determination over the five test sets.From Table 2, we see that the execution of Stage 3 was successful, where the average of testR is over 0.9 for all three chemical properties.For each chemical property π , we selected the ANN N that attained the best test R scoreamong the five ANNs to formulate an MILP M ( x, y, z ; C ) which will be used in Phase 2. Results on Phase 2.
We implemented Stages 4 and 5 in Phase 2 as follows.
Stage 4.
In this step, we solve the MILP M ( x, y, g ; C , C ) formulated based on the ANN N obtained in Phase 1. To solve an MILP in Stage 4, we use CPLEX version 12.8. In our experiment,we choose a target value y ∗ ∈ [ a, a ]. and fix or bound some descriptors in our feature vector asfollows:- Set the 2-leaf-branch number bl ∗ to be each of 2 and 3;- Fix the instance size n ∗ = n ( G ) to be each integer in { , , , , } ;- Set the diameter dia ∗ = dia( G ) be one of the integers in {⌈ (2 / n ∗ ⌉ , ⌈ (3 / n ∗ ⌉} .- Set the maximum degree d max := 3 for dia ∗ = ⌈ (2 / n ∗ ⌉ and d max := 4 for dia ∗ = ⌈ (3 / n ∗ ⌉ ;- For each instance size n ∗ , test a target value y ∗ π for each chemical property π ∈ { K ow , Bp,Hc } .Based on the above setting, we generated six instances for each instance size n ∗ . We set ε = 0 . ∗ = 2 (resp., bl ∗ = 3),where we denote the following: 24 y ∗ π : a target value in [ a, a ] for a property π ;- n ∗ : a specified number of vertices in [ n, n ];- dia ∗ : a specified diameter in {⌈ (2 / n ∗ ⌉ , ⌈ (3 / n ∗ ⌉} ;- IP-time: the time (sec.) to an MILP instance to find vectors x ∗ and g ∗ .Observe that most of the MILP instances with bl ∗ = 2, n ∗ ≤
50 and dia ∗ ≤
30 (resp., bl ∗ = 3, n ∗ ≤
50 and dia ∗ ≤
30) in one minute (resp., in a few minutes). The previously most efficient MILPformulation for inferring chemical acyclic graphs due to Zhang et al. [24] could solve an instancewith only up to n ∗ = 20 for the case of d max = 4 and dia ∗ = 9. Our new MILP formulation onchemical acyclic graphs with bounded 2-branch height considerably improved the tractable size ofchemical acyclic graphs in Stage 4 for the inference problem (II-a).Figure 10(a)-(c) illustrate some chemical acyclic graphs G with bl ( G ) = 2 obtained in Stage 4by solving an MILP. Remember that these chemical graphs obey the AD D defined in Appendix A. (a) (b) (c) Figure 10: An illustration of chemical acyclic graphs G with n ( G ) = 50, bl ( G ) = 2 and d max = 4obtained in Stage 4 by solving an MILP: (a) y ∗ Kow = 9, dia( G ) = ⌈ (2 / n ∗ ⌉ = 20; (b) y ∗ Bp = 880,dia( G ) = n ∗ / y ∗ Hc = 25000, dia( G ) = ⌈ (3 / n ∗ ⌉ = 30.Figure 11(a)-(c) illustrate some chemical acyclic graphs G with bl ( G ) = 3 obtained in Stage 4by solving an MILP. Stage 5.
In this stage, we execute our new graph search algorithms for generating target graphs G ∈ G ( xxx ∗ ) with bl ( G ) ∈ { , } for a given feature vector xxx ∗ obtained in Stage 4.We introduce a time limit of 10 minute for each iteration h in Step 2 and an execution of Steps 1and 3 for bl ∗ = 2 (resp., each iteration h in Steps 2 and 3 and δ in Step 4 and an execution ofSteps 1 and 5 for bl ∗ = 3). In the last step, we choose at most 100 feasible vector pairs andgenerate a target graph from each of these feasible vector pairs. We also impose an upper boundUB on the size | W | of a vector set W that we maintain during an execution of the algorithm. Weexecuted the algorithm for each of the three bounds UB = 10 , , until a feasible vector pairis found or the running time exceeds a global time limitation of two hours.25 a) (b) (c) Figure 11: An illustration of chemical acyclic graphs G with n ( G ) = 50, bl ( G ) = 3 and d max = 4obtained in Stage 4 by solving an MILP: (a) y ∗ Kow = 9, dia( G ) = ⌈ (2 / n ∗ ⌉ = 20; (b) y ∗ Bp = 880,dia( G ) = n ∗ / y ∗ Hc = 25000, dia( G ) = ⌈ (3 / n ∗ ⌉ = 30.When no feasible vector pair is found by the graph search algorithms, we output the targetgraph G ∗ constructed from the vector g ∗ in Stage 4.Tables 3 to 4 (resp., Tables 5 to 6) show the results on Stage 5 for bl ∗ = 2 (resp., bl ∗ = 3),where we denote the following:- xxx ∗ ;- G-LB: a lower bound on the number of all target graphs G ∈ G ( xxx ∗ ) for a given feature vector xxx ∗ ;- G such that f ( G ) = x ∗ (whereat least one such graph G has been found from the vector g ∗ in Stage 4);- G-time: the running time (sec.) to execute Stage 5 for a given feature vector xxx ∗ . “ > n ∗ up to 16 was solved in Stage 5by Azam et al. [3]. For the classes of chemical graphs with cycle index 1 and 2, the maximumsize of instances solved in Stage 5 by Ito et al. [3] and Zhu et al. [25] was around 18 and 15,respectively. Our new algorithm based on dynamic programming solve instances with n ∗ = 50. Inour experiments, we also computed a lower bound G-LP on the number of target graphs. Observethat there are over 10 or 10 target graphs in some cases. Remember that these lower boundsare computed without actually generating each target graph one by one. So when a lower boundis enormously large, this would suggest that we may need to impose some more constraints on thestructure of graphs or the range of descriptors to narrower a family of target graphs to be inferred.26able 3: Results of Stages 4 and 5 for bl ∗ = 2, d max = 3 and dia ∗ = ⌈ n ∗ ⌉ . π y ∗ n ∗ dia ∗ IP-time K ow . ×
100 0.915 32 13 4.81 216 2 . ×
100 10.647 38 16 7.27 19,931 4 . ×
100 48.298 44 18 9.33 241,956 1 . ×
100 119.019 50 20 21.57 58,365 1 . ×
100 110.38 Bp
440 26 11 2.09 22,342 3 . ×
100 2.9550 32 13 3.94 748 5 . ×
100 3.77660 38 16 6.4 39,228 7 . ×
100 151.25770 44 18 7.21 138,076 3 . ×
100 182.66880 50 20 9.49 106,394 3 . ×
100 217.18 Hc . ×
12 0.0416500 32 13 7.67 2,722 1 . ×
100 0.3120000 38 16 10.5 1,830 9 . ×
100 1.0623000 44 18 13.62 12,336 4 . ×
100 142.0225000 50 20 15.1 136,702 5 . ×
100 22.26
An Additional Experiment.
We also conducted some additional experiment to demonstratethat our MILP-based method is flexible to control conditions on inference of chemical graphs. InStage 3, we constructed an ANN N π for each of the three chemical properties π ∈ { K ow , Bp , Hc } ,and formulated the inverse problem of each ANN N π as an MILP M π . Since the set of descriptorsis common to all three properties K ow , Bp and Hc , it is possible to infer a chemical acyclic graph G that satisfies a target value y ∗ π for each of the three properties at the same time (if one exists).We specify the size of graph so that n ∗ = 50, bl ∗ = 2, dia ∗ = 25 and d max = 4, and set targetvalues with y ∗ Kow = 4 . y ∗ Bp = 400 . y ∗ Hc = 13000 . M Kow , M Hc and M Bp . The MILP was solved in 18930 (sec) and we obtained a chemical acyclicgraph G illustrated in Figure 12. We continued to execute Stage 5 for this instance to generatemore target graphs G ∗ . Table 7 shows that 100 target graphs are generated by our new dynamicprogramming algorithm. 27able 4: Results of Stages 4 and 5 for bl ∗ = 2, d max = 4 and dia ∗ = ⌈ n ∗ ⌉ . π y ∗ n ∗ dia ∗ IP-time K ow . ×
100 1.185 32 20 24.74 1,650 5 . ×
100 0.697 38 23 38.88 154,408 9 . ×
100 67.318 44 27 38.73 1,122,126 8 . ×
100 660.379 50 30 31.59 690,814 1 . ×
100 238.02 Bp
440 26 16 12.44 8,156 2 . ×
100 2.74550 32 20 23.22 38,600 4 . ×
100 12.72660 38 23 20.62 52,406 1 . ×
100 197.89770 44 27 50.55 23,638 6 . ×
100 244.56880 50 30 48.37 40,382 2 . ×
100 884.99 Hc . ×
100 0.0616500 32 20 44.2 448 6 . ×
100 0.6320000 38 23 96.02 3,330 6 . ×
100 15.1623000 44 27 82.34 43,686 1 . ×
100 152.9625000 50 30 83.81 311,166 1 . ×
100 287.95Figure 12: An illustration of a chemical acyclic graph G inferred for three chemical properties K ow , Bp and Hc simultaneously, where y ∗ Kow = 4 . y ∗ Bp = 400 . y ∗ Hc = 13000 . n ∗ = 50, bl ∗ = 2,dia ∗ = 25 and d max = 4. 28able 5: Results of Stages 4 and 5 for bl ∗ = 3, d max = 3 and dia ∗ = ⌈ n ∗ ⌉ . π y ∗ n ∗ dia ∗ IP-time K ow . ×
100 14.315 32 13 4.72 3,510 6 . ×
100 851.217 38 16 5.82 11,648 1 . ×
100 612.868 44 18 9.69 17,239 2 . ×
100 703.929 50 20 22.53 60,792 3 . ×
100 762.17 Bp
440 26 11 3.01 66 9 . ×
66 902.77550 32 13 4.29 308 1 . ×
100 2238.62660 38 16 5.86 303 1 . ×
100 3061.11770 44 18 14.39 19,952 4 . ×
100 678.26880 50 20 10.39 17,993 7 . ×
100 4151.07 Hc . ×
100 1.5716500 32 13 5.81 600 3 . ×
100 921.5520000 38 16 15.67 18,502 6 . ×
100 1212.5423000 44 18 21.15 5,064 6 . ×
100 1279.9525000 50 20 31.90 41,291 2 . ×
100 668.5Table 6: Results of Stages 4 and 5 for bl ∗ = 3, d max = 4 and dia ∗ = ⌈ n ∗ ⌉ . π y ∗ n ∗ dia ∗ IP-time K ow . ×
100 6.735 32 20 16.58 348 1 . ×
100 3400.747 38 23 33.71 17,557 1 . ×
100 2652.388 44 27 34.28 0 0 1 > . ×
100 6423.85 Bp
440 26 16 14.16 150 1 . ×
100 29.72550 32 20 18.94 305 1 . ×
100 2641.9660 38 23 21.15 1,155 2 . ×
100 4521.66770 44 27 25.6 1,620 4 . ×
100 175.2880 50 30 63.22 0 0 1 > Hc . ×
12 0.6616500 32 20 41.03 392 3 . ×
100 2480.3420000 38 23 48.48 630 1 . ×
100 105.5923000 44 27 143.75 341 7 . ×
100 5269.125000 50 30 315.91 10,195 3 . ×
100 5697.0829able 7: Results of Stages 4 and 5 for bl ∗ = 2, d max = 4, n ∗ = 50 and dia ∗ = 25. π y ∗ n ∗ dia ∗ IP-time K ow . ×
100 423.53 Bp Hc Concluding Remarks
In this paper, we introduced a new measure, branch-height of a tree, and showed that many ofchemical compounds in the chemical database have a simple structure where the number of 2-branches is small. Based on this, we proposed a new method of applying the framework for inverseQSAR/QSPR [3, 5, 24] to the case of acyclic chemical graphs where Azam et al. [3] inferredchemical graphs with around 20 non-hydrogen atoms and Zhang et al. [24] solved an MILP ofinferring a feature vector for an instance with up to around 50 non-hydrogen atoms and diameter8. In our method, we formulated a new MILP in Stage 4 specialized for acyclic chemical graphswith a small branch number and designed a new graph search algorithm in Stage 5 that computesfrequency vectors of graphs in a dynamic programming scheme. We implemented our new methodand conducted some experiments on chemical properties such as octanol/water partition coefficient,boiling point and heat of combustion. The resulting method improved the performance so thatchemical graphs with around 50 non-hydrogen atoms and around diameter 30 can be inferred.Since there are many acyclic chemical compounds having large diameters, this is a significantimprovement.It is left as a future work to design MILPs and graph search algorithms based on the new ideaof the paper for classes of graphs with a higher rank.
Abbreviations
ANN: artificial neural network; MILP: mixed integer linear programming
Acknowledgements
This research was supported, in part, by Japan Society for the Promotionof Science, Japan, under Grant
Authors’ contributions
Conceptualization, H.N. and T.A.; methodology, H.N.; software, N.A.A.,J.Z., Y.Sun, Y.Shi, A.S. and L.Z.; validation, N.A.A., J.Z., A.S. and H.N.; formal analysis, H.N.;data resources, A.S., L.Z., H.N. and T.A.; writing–original draft preparation, H.N.; writing–reviewand editing, N.A.A., A.S. and T.A.; project administration, H.N.; funding acquisition, T.A. Allauthors have read and agreed to the published version of the manuscript.
Availability of data and materials
Source code of the implementation of our algorithm is freelyavailable from https://github.com/ku-dml/mol-infer . Competing interests
The authors declare that they have no competing interests.
Author details Department of Applied Mathematics and Physics, Kyoto University, Kyoto 606-8501, Japan. Graduate School of Advanced Integrated Studies in Human Survavibility, KyotoUniversity, Kyoto 606-8306. Bioinformatics Center, Institute for Chemical Research, KyotoUniversity, Uji 611-0011, Japan. 31 eferences [1] Akutsu T, Fukagawa D, Jansson J, Sadakane K, Inferring a graph from path frequency, Dis-crete Applied Mathematics, vol. 160, no. 10-11, pp. 1416–1428, 2012.[2] Akutsu T, Nagamochi H, A mixed integer linear programming formulation to artificial neu-ral networks, Proceedings of the 2nd International Conference on Information Science andSystems, March 2019, pp. 215–220.[3] Azam N A, R. Chiewvanichakorn, Zhang F, Shurbevski A, Nagamochi H, Akutsu T, A methodfor the inverse QSAR/QSPR based on artificial neural networks and mixed integer linearprogramming, BIOINFORMATICS2020, Malta, February 2020, pp.101–108.[4] Bohacek R S, McMartin C, Guida W C, The art and practice of structure-based drug design:A molecular modeling perspective, Medicinal Research Reviews, vol. 16, no. 1, pp. 3–50, 1996.[5] R. Chiewvanichakorn, Wang C, Zhang Z, Shurbevski A, Nagamochi H, Akutsu T, A methodfor the inverse QSAR/QSPR based on artificial neural networks and mixed integer linearprogramming, ICBBB2020, Kyoto, January 2020, paper K0013.[6] Fujiwara H, Wang J, Zhao L, Nagamochi H, Akutsu T, Enumerating treelike chemical graphswith given path frequency, Journal of Chemical Information and Modeling, vol. 48, no. 7, pp.1345–1357, 2008.[7] G´omez-Bombarelli R, Wei J N, D. Duvenaud, Hern´andez-Lobato J M, S´anchez-Lengeling B,Sheberla D, Aguilera-Iparraguirre J, Hirzel T D, Adams R P, Aspuru-Guzik A, Automaticchemical design using a data-driven continuous representation of molecules, ACS CentralScience, vol. 4, no. 2, pp. 268–276, 2018.[8] Ikebata H, Hongo K, Isomura T, Maezono R, Yoshida R, Bayesian molecular design with achemical language model, Journal of Computer-aided Molecular Design, vol. 31, no. 4, pp.379–391, 2017.[9] Ito R, N. A. Azam, Wang C, Shurbevski A, Nagamochi H, Akutsu T, A novel method for theinverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networksand integer programming, BIOCOMP2020, Las Vegas, Nevada, USA, 27-30 July 2020 (toappear).[10] Kerber A, Laue R, Gr¨uner T, Meringer M, MOLGEN 4.0, Match Communications in Math-ematical and in Computer Chemistry, no. 37, pp. 205–208, 1998.[11] Kusner M J, Paige B, Hern´andez-Lobato J M, Grammar variational autoencoder, Proceedingsof the 34th International Conference on Machine Learning-Volume 70, 2017, pp. 1945–1954.[12] J. Li, Nagamochi H, Akutsu T, Enumerating substituted benzene isomers of tree-like chemicalgraphs, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 15,no. 2, pp. 633–646, 2016. 3213] Miyao M, Kaneko H, Funatsu K, Inverse QSPR/QSAR analysis for chemical structure genera-tion (from y to x), Journal of Chemical Information and Modeling, vol. 56, no. 2, pp. 286–299,2016.[14] Nagamochi H, A detachment algorithm for inferring a graph from path frequency, Algorith-mica, vol. 53, no. 2, pp. 207–224, 2009.[15] Netzeva T I, P. Worth A P, Aldenberg T, Benigni R, T. Cronin T, Gramatica P, Jaworska J S,Kahn S, Klopman G, Marchant C A, Current status of methods for defining the applicabilitydomain of (quantitative) structure-activity relationships: The report and recommendations ofECVAM workshop 52, Alternatives to Laboratory Animals, vol. 33, no. 2, pp. 155–173, 2005.[16] Reymond J L, The chemical space project, Accounts of Chemical Research, vol. 48, no. 3, pp.722–730, 2015.[17] Rupakheti C, Virshup A, Yang W, Beratan D N, Strategy to discover diverse optimal moleculesin the small molecule universe, Journal of Chemical Information and Modeling, vol. 55, no. 3,pp. 529–537, 2015.[18] Segler M H S, Kogej T, Tyrchan C, Waller M P, Generating focused molecule libraries for drugdiscovery with recurrent neural networks, ACS Central Science, vol. 4, no. 1, pp. 120–131,2017.[19] Skvortsova M I, Baskin I I, Slovokhotova O L, Palyulin V A, Zefirov N S, Inverse problem inQSAR/QSPR studies for the case of topological indices characterizing molecular shape (Kierindices), Journal of Chemical Information and Computer Sciences, vol. 33, no. 4, pp. 630–634,1993.[20] Suzuki M, Nagamochi H, Akutsu T, Efficient enumeration of monocyclic chemical graphs withgiven path frequencies, Journal of Cheminformatics, vol. 6, no. 1, p. 31, 2014.[21] Tamura Y, Nishiyama Y, Wang C, Sun Y, Shurbevski A, Nagamochi H, Akutsu T, Enumerat-ing chemical graphs with mono-block 2-augmented tree structure from given upper and lowerbounds on path frequencies, arXiv:2004.06367, 2020.[22] Yamashita K, Masui R, Zhou X, Wang C, Shurbevski A, Nagamochi H, Akutsu T, Enumer-ating chemical graphs with two disjoint cycles satisfying given path frequency specifications,arXiv:2004.08381, 2020.[23] Yang X, Zhang J, Yoshizoe K, Terayama K, Tsuda K, ChemTS: an efficient python library forde novo molecular generation, Science and Technology of Advanced Materials, vol. 18, no. 1,pp. 972–976, 2017.[24] Zhang F, Zhu J, Chiewvanichakorn C, Shurbevski A, Nagamochi H, Akutsu T, A new integerlinear programming formulation to the inverse QSAR/QSPR for acyclic chemical compoundsusing skeleton trees, The 33rd International Conference on Industrial, Engineering and OtherApplications of Applied Intelligent Systems, September 22-25, 2020 Kitakyushu, Japan (toappear). 3325] Zhu J, Wang C Shurbevski A, Nagamochi H, Akutsu T, A novel method for inference ofchemical compounds of cycle index two with desired properties based on artificial neuralnetworks and integer programming, Algorithms, vol. 13, no. 5, 124, 2020.
A Statistical Feature of Molecular Structure
We observe the following features of the graph-theoretical structure of chemical graphs registeredin the chemical database PubChem. Let DB ( ≤ n ) denote the set of chemical graphs with at most n non-hydrogen atoms that are registered in chemical database PubChem. The cycle index (or rank )of a chemical graph G = ( H = ( V, E ) , α, β ) is defined to be | E | − ( | V | −
1) (i.e., the minimumnumber of edges to be removed to make the graph H acyclic). We call a chemical graph a rank- r chemical graph if the rank of the graph is r . The core of a chemical cyclic graph G is defined tobe the induced subgraph G ′ of G such that G ′ consists of vertices in a cycle or vertices in a pathjoining two cycles. A vertex in the core (not in the core) is called a core vertex (resp., a non-corevertex). The edges not in the core of a chemical cyclic graph G form a collection of trees T , whichwe call a non-core tree . Each non-core tree contains exactly one core vertex and is regarded as atree rooted at the core vertex. The k -branch height of a chemical cyclic graph G is defined to bethe maximum of k -branch heights over all non-core trees.Let ρ r (%) denote the ratio of the number of chemical graphs with rank at most r ∈ [0 ,
4] tothe number of all chemical graphs in PubChem. See Table 8.Table 8: The percentage ρ r of the number of chemical compounds with rank at most r ∈ [0 , ρ ρ ρ ρ ρ .
9% 16 .
3% 44 .
5% 68 .
8% 84 . ρ ( d )0 (%) denote the ratio of the number of chemical graphs in DB ( ≤ such that themaximum degree is at most d ∈ [3 ,
4] to the number of all chemical graphs in DB ( ≤ . Let ρ ( d ) r (%), r ∈ [1 ,
4] denote the ratio of the number of rank- r chemical graphs in DB ( ≤ such that themaximum degree of a non-core vertex is at most d ∈ [3 ,
4] to the number of all rank- r chemicalgraphs in DB ( ≤ . See Table 9.Table 9: The percentage ρ ( d ) r of the number of chemical compounds with rank r ∈ [0 ,
4] such thatthe maximum degree of a non-core vertex is at most d ∈ [3 ,
4] over all rank- r chemical compoundsin DB ( ≤ . ρ (3)0 ρ (4)0 ρ (3)1 ρ (4)1 ρ (3)2 ρ (4)2 ρ (3)3 ρ (4)3 ρ (3)4 ρ (4)4 .
55% 99 .
85% 68 .
30% 99 .
97% 84 .
46% 99 .
99% 87 .
11% 99 .
99% 87 .
75% 99 . ρ r ( k, h ) (%), r ∈ [0 , k = 2, h ∈ [1 ,
2] denote the ratio of the number of rank- r chemicalgraphs in DB ( ≤ such that the k -branch height is at most h to the number of all rank- r chemicalgraphs in DB ( ≤ . See Table 10. We see that most chemical graphs G with at most 50 non-hydrogen atoms satisfy bh ( G ) ≤ ρ r ( k, h ) (%) of the number of rank- r chemical graphs in DB ( ≤ suchthat the k -branch height is at most h to the number of all rank- r chemical graphs in DB ( ≤ . ρ (2 , ρ (2 , ρ (2 , ρ (2 , ρ (2 , ρ (2 , ρ (2 , .
23% 99 .
46% 88 .
13% 98 .
76% 96 .
39% 99 .
17% 99 . C n H n +2 . Let Aln( n ) denote the setof all alkans with n carbon atoms, where | Aln(25) | = 36 , , ρ Aln (2 , h ) (%), h ∈ [1 , h tothe number of alkans in Aln(25). See Table 11.Table 11: The percentage ρ Aln (2 , h ) (%) of the number of alkans in Aln(25) such that the 2-branchheight is at most h to the number of alkans in Aln(25). ρ Aln (2 , ρ Aln (2 , ρ Aln (2 , ρ Aln (2 , .
03% 97 .
67% 99 .
99% 100 . ρ ( δ ) denote the ratio of the number of acyclic chemical graphs in DB ( ≤ such that thedegree of the root of the 2-branch-tree is δ ∈ [1 ,
4] to the number of all acyclic chemical graphs inDB ( ≤ . See Table 12.Table 12: The percentage ρ ( δ ) of the number of acyclic chemical graphs in DB ( ≤ such thatthe degree of the root of the 2-branch-tree is δ ∈ [1 ,
4] to the number of all acyclic chemical graphsin DB ( ≤ . ρ (1) ρ (2) ρ (3) ρ (4)6 .
39% 83 .
58% 9 .
30% 0 . T of all acyclic chemical graphs in DB ( ≤ , over 90% of them satisfy n ≤ d + 2 for the number n = | V ( T ) | of non-hydrogen atoms in a 2-fringe-tree T and the number d of non-hydrogen atoms adjacent to the root in T .Let F T , denote the set of all 2-fringe-trees that appear in an acyclic chemical graph inDB ( ≤ , and F T ( δ )0 , , δ ∈ [1 ,
3] denote the set of all 2-fringe-trees T ∈ F T , that has δ children(i.e., the degree of the root is δ ). Let ρ ( δ )2 δ +2 (%) denote the ratio of the number of 2-fringe-trees in F T ( δ )0 , that has at most 2 d + 2 vertices to the number of 2-fringe-trees in F T ( δ )0 , . See Table 13.35able 13: The percentage ρ ( δ )2 δ +2 (%) of the number of 2-fringe-trees in F T ( δ )0 , that has at most2 d + 2 vertices to the number of 2-fringe-trees in F T ( δ )0 , . ρ (1)4 ρ (2)6 ρ (3)8 .
77% 93 .
99% 92 . B All Constraints in an MILP Formulation for ChemicalAcyclic Graphs
To formulate an MILP that represents a chemical graph, we distinguish a tuple ( a , b , m ) from atuple ( b , a , m ). For a tuple γ = ( a , b , m ) ∈ Λ × Λ × { , , } , let γ denote the tuple ( b , a , m ). LetΓ < , { γ | γ ∈ Γ > } . We call a tuple γ = ( a , b , m ) ∈ Λ × Λ × { , , } proper if m ≤ min { val( a ) , val( b ) } and m ≤ max { val( a ) , val( b ) } − G must consist of two atoms of a = b . Assumethat each tuple γ ∈ Γ is proper. Let ǫ be a fictitious chemical element that represents null, call atuple ( a , b ,
0) with a , b ∈ Λ ∪ { ǫ } fictitious , and define Γ to be the set of all fictitious tuples; i.e.,Γ = { ( a , b , | a , b ∈ Λ ∪ { ǫ }} . To represent chemical elements e ∈ Λ ∪ { ǫ } ∪ Γ in an MILP, weencode these elements e into some integers denoted by [ e ]. Assume that, for each element a ∈ Λ,[ a ] is a positive integer and that [ ǫ ] = 0. B.1 Upper and Lower Bounds on Descriptors
In our formulation of an MILP for inferring a vector x ∗ in Stage 4, we fix the following descriptorsas specified constants: the number n ( G ) of vertices , the diameter dia( G ), and the number bl k ∗ ( G )of leaf k ∗ -leaf branches, which are set to be given integers n ∗ , dia ∗ and bl ∗ , respectively. For eachof the other descriptors, we specify a lower bound LB and an upper bound UB on the value sothat the descriptor takes a value from the range between LB and UB. constants: n ∗ ≥
5: the size n ( G ) of G ;LB indg ( i ) , UB indg ( i ) ∈ [0 , n ∗ ], i ∈ [1 , in i ( G )of k ∗ -internal vertices of degree i in G ;LB exdg ( i ) , UB exdg ( i ) ∈ [0 , n ∗ ], i ∈ [1 , ex i ( G )of k ∗ -internal vertices of degree i in G ;LB ince ( a ) , UB ince ( a ) ∈ [0 , n ∗ ], a ∈ Λ: lower and upper bounds on the number ce in a ( G )of k ∗ -internal vertices v with α ( v ) = a in G ;LB exce ( a ) , UB exce ( a ) ∈ [0 , n ∗ ], a ∈ Λ: lower and upper bounds on the number ce ex a ( G )of k ∗ -external vertices v with α ( v ) = a in G ;LB inbd ( m ) , UB inbd ( m ) ∈ [0 , n ∗ − m ∈ [2 , in m ( G )of k ∗ -internal edges e with β ( e ) = m in G ; 36B exbd ( m ) , UB exbd ( m ) ∈ [0 , n ∗ − m ∈ [2 , ex m ( G )of k ∗ -external edges e with β ( e ) = m in G ;LB inac ( γ ) , UB inac ( γ ) ∈ [0 , n ∗ − γ ∈ Γ < ∪ Γ = : lower and upper bounds on the number ac in γ ( G )of k ∗ -internal edges e with adjacency-configuration γ in G ;LB exac ( γ ) , UB exac ( γ ) ∈ [0 , n ∗ − γ ∈ Γ < ∪ Γ = : lower and upper bounds on the number ac ex γ ( G )of k ∗ -external edges e with adjacency-configuration γ in G ;LB tbc ( µ ) , UB tbc ( µ ) ∈ [0 , n ∗ − µ ∈ Bc: lower and upper bounds on the number bc in µ ( G )of k ∗ -internal edges e with bond-configuration µ in G ;LB exbc ( µ ) , UB exbc ( µ ) ∈ [0 , n ∗ − µ ∈ Bc: lower and upper bounds on the number bc ex µ ( G )of k ∗ -internal edges e with bond-configuration µ in G ; variables x for descriptors: dg in ( i ) , dg ex ( i ) ∈ [0 , n ∗ ], i ∈ [1 , in ( i ) (resp., dg ex ( i )) represents dg in i ( G ) (resp., dg ex i ( G ));ce in ( a ) , ce ex ( a ) ∈ [0 , n ∗ ], a ∈ Λ: ce in ( a ) (resp., ce ex ( a )) represents ce in a ( G ) (resp., ce ex a ( G ));bd in ( m ) , bd ex ( m ) ∈ [0 , n ∗ ], m ∈ [1 , in ( m ) (resp., bd ex ( m ))represents bd in m ( G ) (resp., bd ex m ( G ));ac in ( γ ) , ac ex ( γ ) ∈ [0 , n ∗ ], γ ∈ Γ < ∪ Γ = : ac in ( γ ) (resp., ac ex ( γ )) represents represents ac in γ ( G )(resp., ac ex γ ( G ));bc in ( µ ) , bc ex ( µ ) ∈ [0 , n ∗ − µ ∈ Bc: bc in ( µ ) (resp., bc ex ( µ )) represents represents bc in µ ( G )(resp., bc ex µ ( G )); constraints: LB tdg ( i ) ≤ dg t ( i ) ≤ UB tdg ( i ) , i ∈ [1 , , t ∈ { in , ex } , (3)LB tce ( a ) ≤ ce t ( a ) ≤ UB tce ( a ) , a ∈ Λ , t ∈ { in , ex } , (4)LB tbd ( m ) ≤ bd t ( m ) ≤ UB tbd ( m ) , m ∈ [2 , , t ∈ { in , ex } , (5)LB tac ( γ ) ≤ ac t ( γ ) ≤ UB tac ( γ ) , γ ∈ Γ , t ∈ { in , ex } , (6)LB tbc ( µ ) ≤ bc t ( µ ) ≤ UB tbc ( µ ) , µ ∈ Bc , t ∈ { in , ex } . (7)We use the range-based method to define an applicability domain for our method. For this,we find the range (the minimum and maximum) of each descriptor over all relevant chemicalcompounds and represent each range as a set of linear constraints in the constraint set C of ourMILP formulation. Recall that D π stands for a set of chemical graphs used for constructing aprediction function. However, the number of examples in D π may not be large enough to capturea general feature on the structure of chemical graphs. For this, we also use some data set fromthe whole set DB of chemical graphs in a data base. Let DB ( i ) G denote the set of chemical graphs G ∈ DB ∩ G such that n ( G ) = i for each integer i ≥
1. Based on this, we assume that the givenlower and upper bounds on the above descriptors satisfy the following. For each t ∈ { in , ex } , n ∗ min G ∈ D π ∪ DB ( n ∗ ) G dg t i ( G ) n ( G ) ≤ LB tdg ( i ) ≤ UB tdg ( i ) ≤ n ∗ max G ∈ D π ∪ DB ( n ∗ ) G dg t i ( G ) n ( G ) , i ∈ [1 , , (8) n ∗ min G ∈ D π ∪ DB ( n ∗ ) G ce t a ( G ) n ( G ) ≤ LB tce ( a ) ≤ UB tce ( a ) ≤ n ∗ max G ∈ D π ∪ DB ( n ∗ ) G ce t a ( G ) n ( G ) , a ∈ Λ , (9)37 n ∗ −
1) min G ∈ D π ∪ DB ( n ∗ ) G bd t m ( G ) n ( G ) − ≤ LB tbd ( m ) ≤ UB tbd ( m ) ≤ ( n ∗ −
1) max G ∈ D π ∪ DB ( n ∗ ) G bd t m ( G ) n ( G ) − , m ∈ [2 , , (10)( n ∗ −
1) min G ∈ D π ∪ DB ( n ∗ ) G ac t γ ( G ) n ( G ) − ≤ LB tac ( γ ) ≤ UB tac ( γ ) ≤ ( n ∗ −
1) max G ∈ D π ∪ DB ( n ∗ ) G ac t γ ( G ) n ( G ) − , γ ∈ Γ , (11)( n ∗ −
1) min G ∈ D π ∪ DB ( n ∗ ) G bc t µ ( G ) n ( G ) − ≤ LB tbc ( µ ) ≤ UB tbc ( µ ) ≤ ( n ∗ −
1) max G ∈ D π ∪ DB ( n ∗ ) G bc t µ ( G ) n ( G ) − , µ ∈ Bc . (12) B.2 Construction of Scheme Graph
We infer a subgraph H such that the maximum degree is d max ∈ { , } , n ( H ) = n ∗ , bh k ∗ ( H ) = bh ∗ and bl k ∗ ( H ) = bl ∗ . For this, we first construct the scheme graph SG( d max , k ∗ , bh ∗ , t ∗ ). We thenprepare a binary variable u ( s, i ) (resp., v ( t, i )) for each vertex u s,i in tree S s (resp., v t,i in tree T t ).Recall that when the two end-vertices of edge a i = ( u s, , u s ′ , ) ∈ E B = { a , a . . . , a c ∗ } is connected in a selected subgraph H , either edge a i is directly used in H or a path P i =( u s, , v t ′ , , v t ′ +1 , , . . . , v t ′′ , , u s ′ , ) from u s, to u s ′ , visiting some vertices in P t ∗ is constructed in H . We regard the index i of each edge a i ∈ E B = { a , a . . . , a c ∗ } as the “color” of the edge, anddefine the color set of E B to be [1 , c ∗ ]. To introduce necessary linear constraints that can constructsuch a path P i properly in our MILP, we assign the color i to the vertices v t ′ , , v t ′ +1 , , . . . , v t ′′ , in P t ∗ when a path P i = ( u s, , v t ′ , , v t ′ +1 , , . . . , v t ′′ , , u s ′ , ) is used in H . constants: Integers d max ∈ { , } , n ∗ ≥
3, dia ∗ ≥ k ∗ ≥
1, bh ∗ ≥ ∗ ≥ variables: a ( i ) ∈ { , } , i ∈ E B : a ( i ) represents edge a i ∈ E B ( a ( i ) = 1, i ∈ E B )( a ( i ) = 1 ⇔ edge a i is used in H ); e ( s, t ) , e ( t, s ) ∈ { , } , s ∈ [1 , s ∗ ], t ∈ [1 , t ∗ ]: e ( s, t ) (resp., e ( t, s )) representsdirection ( u s, , v t, ) (resp., ( v t, , u s, )), where e ( s, t ) = 1 (resp., e ( t, s ) = 1) ⇔ edge u s, , v t, is used in H and direction ( u s, , v t, ) (resp., ( v t, , u s, )) is assignedto edge u s, v t, ; χ ( t ) ∈ [0 , c ∗ ], t ∈ [1 , t ∗ ]: χ ( t ) represents the color c ∈ [0 , c ∗ ] assigned to vertex v t, ( χ ( t ) = c ⇔ vertex v t, is assigned color c , where χ ( t ) = c = 0 iff v t, is not in H ); δ clr ( t, c ) ∈ { , } , t ∈ [1 , t ∗ ], c ∈ [0 , c ∗ ] ( δ clr ( t, c ) = 1 ⇔ χ ( t ) = c );clr( c ) ∈ [0 , t ∗ ], c ∈ [0 , c ∗ ]: the number of vertices v t,i with color c ;deg b+ ( s ) ∈ [0 , s ∈ [1 , s ∗ ]: the out-degree of vertex u s, in the k ∗ -branch-subtree of H ;deg b − ( s ) ∈ [0 , s ∈ [1 , s ∗ ]: the in-degree of vertex u s, in the k ∗ -branch-subtree of H ;38 onstraints: X c ∈ [0 ,c ∗ ] δ clr ( t, c ) = 1 , X c ∈ [0 ,c ∗ ] c · δ clr ( t, c ) = χ ( t ) , t ∈ [1 , t ∗ ] , (13) X t ∈ [1 ,t ∗ ] δ clr ( t, c ) = clr( c ) , c ∈ [0 , c ∗ ] , (14) t ∗ (1 − a ( i )) ≥ clr( i ) , i ∈ [1 , c ∗ ] , (15) e ( s, t ) + e ( t, s ) ≤ , s ∈ [1 , s ∗ ] , t ∈ [1 , t ∗ ] , (16) X s ∈ [1 ,s ∗ ] \{ head( c ) } e ( t, s ) ≤ − δ clr ( t, c ) , X s ∈ [1 ,s ∗ ] \{ tail( c ) } e ( s, t ) ≤ − δ clr ( t, c ) , c ∈ [1 , c ∗ ] , t ∈ [1 , t ∗ ] , (17) X i ∈ E − B ( s ) a ( i ) + X t ∈ [1 ,t ∗ ] e ( t, s ) = deg b − ( s ) , X i ∈ E + B ( s ) a ( i ) + X t ∈ [1 ,t ∗ ] e ( s, t ) = deg b+ ( s ) , deg b − ( s ) + deg b+ ( s ) ≤ d max , s ∈ [1 , s ∗ ] . (18) B.3 Selecting a Subgraph
From the scheme graph SG( d max , k ∗ , bh ∗ , t ∗ ), we select a subgraph H such that n ( H ) = n ∗ ,dia( H ) = dia ∗ , bh k ∗ ( H ) = bh ∗ and bl k ∗ ( H ) = bl ∗ . constants: Integers d max ∈ { , } , n ∗ ≥
3, dia ∗ ≥ k ∗ ≥
1, bh ∗ ≥ ∗ ≥ S s = T ( d max − , d max − , k ∗ ),the set Cld S ( i ) of the indices of children of a vertex v i ;the index prt( i ) of the parent of a non-root vertex v i ;the set Dsn S ( d ) of indices i of a vertex v i whose depth is d ;a proper set P prc ( d max − , d max − , k ∗ ) of index pairs,where we denote P prc ( d max − , d max − , k ∗ ) by P S, prc ;For each tree T t = T ( d max − , d max − , k ∗ ),the set Cld T ( i ) of the indices of children of a vertex v i ;the index prt( i ) of the parent of a non-root vertex v i ;a proper set P prc ( d max − , d max − , k ∗ ) of index pairs,where we denote P prc ( d max − , d max − , k ∗ ) by P T, prc ; variables: σ ( s ) ∈ { , } , s ∈ [1 , s ∗ ]: ( σ ( s ) = 1 ⇔ vertex u s, is a non-leaf k ∗ -branch or a root); u ( s, i ) ∈ { , } , s ∈ [1 , s ∗ ], i ∈ [1 , n Stree ]: u ( s, i ) represents vertex u s,i ( u ( s, i ) = 1 ⇔ vertex u s,i is used in H and edge e ′ s,i ( i ≥
2) is used in H ),39 u ( s,
1) = 1 and σ ( s ) = 0 ⇔ vertex u s, is a leaf k ∗ -branch); v ( t, i ) ∈ { , } , t ∈ [1 , t ∗ ], i ∈ [1 , n Ttree ]: v ( t, i ) represents vertex v t,i ( v ( t, i ) = 1 ⇔ vertex v t,i is used in H and edge e t,i ( i ≥
2) is used in H ); e ( t ) ∈ { , } , t ∈ [1 , t ∗ + 1]: e ( t ) represents edge e t, = v t − , v t, ,where e , and e t ∗ +1 , are fictitious edges ( e ( t ) = 1 ⇔ edge e t, is used in H ); constraints: u ( s, i ) ≥ u ( s, j ) , s ∈ [1 , s ∗ ] , ( i, j ) ∈ P S, prc , (19) v ( t, i ) ≥ v ( t, j ) , t ∈ [1 , t ∗ ] , ( i, j ) ∈ P T, prc , (20) X s ∈ [1 ,s ∗ ] ,i ∈ [1 ,n Stree ] u ( s, i ) + X t ∈ [1 ,t ∗ ] ,i ∈ [1 ,n Ttree ] v ( t, i ) = n ∗ , (21) X i ∈ [1 ,n Stree ] u ( s, i ) ≤ X j ∈ Cld S (1) u ( s, j ) , s ∈ [1 , s ∗ ] , (22) X i ∈ [1 ,n Ttree ] v ( t, i ) ≤ X j ∈ Cld T (1) v ( t, j ) , t ∈ [1 , t ∗ ] , (23) e ( t + 1) + X s ∈ [1 ,s ∗ ] e ( t, s ) = v ( t, , e ( t ) + X s ∈ [1 ,s ∗ ] e ( s, t ) = v ( t, , (where e (1) = e ( t ∗ + 1) = 0) , t ∈ [1 , t ∗ ] , (24) X c ∈ [1 ,c ∗ ] δ clr ( t, c ) = v ( t, , t ∈ [1 , t ∗ ] , (25) c ∗ · (1 − e ( t + 1)) ≥ χ ( t ) − χ ( t + 1) ≥ v ( t, − e ( t + 1) , t ∈ [1 , t ∗ − , (26) a ( i ) + X t ∈ [1 ,t ∗ ] e ( t, i + 1) = u ( i + 1 , , i ∈ [1 , c ∗ ] , (27) σ ( s ) = u ( s,
1) = 1 , if u s is the root , (28) σ ( s ) ≤ u ( s, , s ∈ [1 , s ∗ ] , (29)( d max − σ ( s ) ≥ X s ′ ∈ Cld B ( s ) u ( s ′ , ≥ σ ( s ) , X i ∈ Dsn S ( k ∗ ) u ( s, i ) ≥ u ( s, − σ ( s ) ,s ∈ [1 , s ∗ ] , u s = root , (30)40 s ∈ [2 ,s ∗ ] ( u ( s, − σ ( s )) = bl ∗ , X s ∈ V B (bh ∗ ) u ( s, ≥ , (31) X s ∈ V B,s left u ( s,
1) + X i ∈ E B,s left clr( i ) = l dia ∗ m − k ∗ , X s ∈ V B,s right u ( s,
1) + X i ∈ E B,s right clr( i ) = j dia ∗ k − k ∗ , (32) X i ∈ V B,s u ( i,
1) + X i ∈ E B,s clr( i ) ≤ j dia ∗ k − k ∗ , s ∈ L B \ { s left , s right } . (33)Constraints (22) and (23) represent an extension of the constraint (1) on the size of 2-fringe-treeto the case of the general branch-parameter k ∗ . B.4 Assigning Multiplicity
We prepare an integer variable e β ( e ) or b β ( e ) for each edge e in the scheme graph SG( d max , k ∗ , bh ∗ , t ∗ )to denote the multiplicity of e in a selected graph H and include necessary constraints for thevariables to satisfy in H . constants: Prepare functions tail and head such that a i = ( u tail( i ) , u head( i ) ) ∈ E B ;Assume that each edge in a tree S s , s ∈ [1 , s ∗ ] (resp., T t , t ∈ [1 , t ∗ ]) is denoted by e ′ s,i (resp., e t,i ) with the integer i ∈ [2 , n Stree ] of the head u s,i (resp., v t,i ) of the edge. variables: e β ( i ) ∈ [0 , i ∈ [1 , c ∗ ]: e β ( i ) represents the multiplicity of edge a i ,where e β ( i ) = 0 if edge a i is not in an inferred chemical graph G ; e β ( p, i ) ∈ [0 , p ∈ [1 , s ∗ + t ∗ ], i ∈ [2 , n Stree ]: e β ( p, i ) with p ≤ s ∗ (resp., p > s ∗ ) representsthe multiplicity of edge e ′ p,i (resp., e p − s ∗ ,i ); e β ( t, ∈ [0 , t ∈ [1 , t ∗ + 1]: e β ( t,
1) represents the multiplicity of edge e t, ; b β ( s, t ) ∈ [0 , s ∈ [1 , s ∗ ], t ∈ [1 , t ∗ ]: b β ( s, t ) represents the multiplicity of edge u s, v t, ; constraints: a ( i ) ≤ e β ( i ) ≤ a ( i ) , i ∈ [1 , c ∗ ] , (34) u ( s, i ) ≤ e β ( s, i ) ≤ u ( s, i ) , s ∈ [1 , s ∗ ] , i ∈ [2 , n Stree ] , (35) v ( t, i ) ≤ e β ( s ∗ + t, i ) ≤ v ( t, i ) , t ∈ [1 , t ∗ ] , i ∈ [2 , n Ttree ] , (36) e ( t ) ≤ e β ( t, ≤ e ( t ) , t ∈ [1 , t ∗ + 1] , (37) e ( s, t ) + e ( t, s ) ≤ b β ( s, t ) ≤ e ( s, t ) + 3 e ( t, s ) , s ∈ [1 , s ∗ ] , t ∈ [1 , t ∗ ] . (38)41 .5 Assigning Chemical Elements and Valence Condition We include constraints so that each vertex v in a selected graph H satisfies the valence condition;i.e., P uv ∈ E ( H ) β ( uv ) ≤ val( α ( u )). With these constraints, a chemical acyclic graph G = ( H, α, β )on a selected subgraph H will be constructed. constants: A set Λ ∪ { ǫ } of chemical elements, where ǫ denotes null;A coding [ a ], a ∈ Λ ∪ { ǫ } such that [ ǫ ] = 0; [ a ] ≥ a ∈ Λ; and [ a ] = [ b ] if a = b ;Let [Λ] and [Λ ∪ { ǫ } ] denote { [ a ] | a ∈ Λ } and { [ a ] | a ∈ Λ ∪ { ǫ }} , respectively;A valence function: val : Λ → [1 , E B ( s ) denote the set of indices i of all edges a i ∈ E B adjacent to vertex u s, in T B . variables: e α ( p, i ) ∈ [Λ ∪ { ǫ } ], p ∈ [1 , s ∗ + t ∗ ], i ∈ [1 , n Stree ]: e α ( p, i ) with p ≤ s ∗ (resp., p > s ∗ ) represents α ( u p,i ) (resp., α ( v p − s ∗ ,i )); δ α ( p, i, a ) ∈ { , } , p ∈ [1 , s ∗ + t ∗ ], i ∈ [1 , n Stree ], a ∈ Λ ∪ { ǫ } : δ α ( p, i, a ) = 1 ⇔ α ( u p,i ) = a for p ≤ s ∗ and α ( v p − s ∗ ,i ) = a for p > s ∗ ; δ e β ( i, m ) ∈ { , } , p ∈ [1 , s ∗ + t ∗ ], i ∈ [1 , c ∗ ], m ∈ [0 , δ e β ( i, m ) = 1 ⇔ the multiplicity of edge a i in an inferred chemical graph G is m ; δ e β ( p, i, m ) ∈ { , } , p ∈ [1 , s ∗ + t ∗ ], i ∈ [2 , n Stree ], m ∈ [0 , δ e β ( p, i, m ) = 1 ⇔ the multiplicity of edge e ′ p,i , p ≤ s ∗ (or e p − s ∗ ,i , p > s ∗ ) in G is m ; δ e β ( t, , m ) ∈ { , } , t ∈ [1 , t ∗ + 1], m ∈ [0 , δ e β ( t, , m ) = 1 ⇔ the multiplicity of edge e t in G is q ; δ b β ( s, t, m ) ∈ { , } , s ∈ [1 , s ∗ ], t ∈ [1 , t ∗ ], m ∈ [0 , δ b β ( s, t, m ) = 1 ⇔ the multiplicity of edge u s, v t, in G is m ; constraints: X a ∈ Λ ∪{ ǫ } δ α ( p, i, a ) = 1 , p ∈ [1 , s ∗ + t ∗ ] , i ∈ [1 , n Stree ] , (39) X a ∈ Λ ∪{ ǫ } [ a ] · δ α ( p, i, a ) = e α ( p, i ) , p ∈ [1 , s ∗ + t ∗ ] , i ∈ [1 , n Stree ] , (40) X m ∈ [0 , δ e β ( i, q ) = 1 , X m ∈ [1 , m · δ e β ( i, m ) = e β ( i ) , i ∈ [1 , c ∗ ] , (41) X m ∈ [0 , δ e β ( p, i, m ) = 1 , X m ∈ [1 , m · δ e β ( p, i, m ) = e β ( p, i ) , p ∈ [1 , s ∗ + t ∗ ] , i ∈ [2 , n Stree ] , (42) X m ∈ [0 , δ e β ( t, , q ) = 1 , X m ∈ [1 , m · δ e β ( t, , m ) = e β ( t, , t ∈ [1 , t ∗ + 1] , (43)42 m ∈ [0 , δ b β ( s, t, m ) = 1 , X m ∈ [0 , mδ b β ( s, t, m ) = b β ( s, t ) , s ∈ [1 , s ∗ ] , t ∈ [1 , t ∗ ] , (44) X i ∈ E B ( s ) e β ( i ) + X t ∈ [1 ,t ∗ ] b β ( s, t ) + X j ∈ Cld S (1) e β ( s, j ) ≤ X a ∈ Λ val( a ) · δ α ( s, , a ) , s ∈ [1 , s ∗ ] , (45) X s ∈ [1 ,s ∗ ] b β ( s, t ) + e β ( t,
1) + e β ( t +1 ,
1) + X j ∈ Cld T (1) e β ( s ∗ + t, j ) ≤ X a ∈ Λ val( a ) δ α ( s ∗ + t, , a ) , t ∈ [1 , t ∗ ] , (46) e β ( s, i ) + X j ∈ Cld S ( i ) e β ( s, j ) ≤ X a ∈ Λ val( a ) δ α ( s, i, a ) , s ∈ [1 , s ∗ ] , i ∈ [2 , n Stree ] , (47) e β ( s ∗ + t, i ) + X j ∈ Cld T ( i ) e β ( s ∗ + t, j ) ≤ X a ∈ Λ val( a ) δ α ( s ∗ + t, i, a ) , t ∈ [1 , t ∗ ] , i ∈ [2 , n Ttree ] . (48) B.6 Descriptors on Mass, the Numbers of Elements and Bonds
We include constraints to compute descriptors ms( G ), ce a ( G ) ( a ∈ Λ), bd m ( G ) ( m ∈ [2 , n H ( G ) according to the definitions in Section 2.2. constants: A function mass ∗ : Λ → Z (we let mass( a ) denote the observed mass of a chemical element a ∈ Λ, and define mass ∗ ( a ) = ⌊ · mass( a ) ⌋ ); variables: Mass ∈ Z : Mass represents P v ∈ V mass ∗ ( α ( v ));bd( m ) ∈ [0 , n ∗ ], m ∈ [1 , H ∈ [0 , n ∗ ]: the number n H ( G ) of hydrogen atoms to be included to G ; constraints: X p ∈ [1 ,s ∗ + t ∗ ] δ α ( p, , a ) = ce in ( a ) , X p ∈ [1 ,s ∗ + t ∗ ] ,i ∈ [2 ,n Stree ] δ α ( p, i, a ) = ce ex ( a ) , a ∈ Λ , (49) X a ∈ Λ mass ∗ ( a )(ce in ( a ) + ce ex ( a )) = Mass , (50) X i ∈ [1 ,c ∗ ] δ e β ( i, q ) + X s ∈ [1 ,s ∗ ] ,t ∈ [1 ,t ∗ ] δ b β ( s, t, q ) + X t ∈ [2 ,t ∗ ] δ e β ( t, , q ) = bd in ( m ) , m ∈ [1 , , (51) X p ∈ [1 ,s ∗ + t ∗ ] ,i ∈ [2 ,n Stree ] δ e β ( p, i, m ) = bd ex ( m ) , m ∈ [1 , , (52) X a ∈ Λ val( a )(ce in ( a ) + ce ex ( a )) − n ∗ − in (2) + bd ex (2) + 2bd in (3) + 2bd ex (3)) = n H . (53)43 .7 Descriptor for the Number of Specified Degree We include constraints to compute descriptors dg i ( G ) ( i ∈ [1 , H is at most 3(resp., equal to 4) when d max = 3 (resp., d max = 4). variables: deg( p, i ) ∈ [0 , p ∈ [1 , s ∗ + t ∗ ], i ∈ [1 , n Stree ]:deg( p, i ) represents deg H ( u p,i ) for p ≤ s ∗ or deg H ( v p − s ∗ ,i ) for p > s ∗ ; δ deg ( p, i, d ) ∈ { , } , p ∈ [1 , s ∗ + t ∗ ], i ∈ [1 , n Stree ], d ∈ [0 , δ deg ( p, i, d ) = 1 ⇔ deg( p, i ) = d ; constraints: X i ∈ E B ( s ) a ( i ) + X t ∈ [1 ,t ∗ ] ( e ( s, t ) + e ( t, s )) + X j ∈ Cld S (1) u ( s, j ) = deg( s, , s ∈ [1 , s ∗ ] , (54) u ( s, i ) + X j ∈ Cld S ( i ) u ( s, j ) = deg( s, i ) , s ∈ [1 , s ∗ ] , i ∈ [2 , n Stree ] , (55)2 v ( t,
1) + X j ∈ Cld T (1) v ( t, j ) = deg( s ∗ + t, , t ∈ [1 , t ∗ ] , (56) v ( t, i ) + X j ∈ Cld T ( i ) v ( t, j ) = deg( s ∗ + t, i ) , t ∈ [1 , t ∗ ] , i ∈ [2 , n Ttree ] , (57) X d ∈ [0 , δ deg ( p, i, d ) = 1 , X d ∈ [1 , d · δ deg ( p, i, d ) = deg( p, i ) , p ∈ [1 , s ∗ + t ∗ ] , i ∈ [1 , n Stree ] , (58) X p ∈ [1 ,s ∗ + t ∗ ] δ deg ( p, , d ) = dg in ( d ) , X p ∈ [1 ,s ∗ + t ∗ ] ,i ∈ [2 ,n Stree ] δ deg ( p, i, d ) = dg ex ( d ) , d ∈ [1 , , (59)dg in (4) + dg ex (4) ≥ d max = 4 (resp., = 3). (60) B.8 Descriptor for the Number of Adjacency-configurations
We include constraints to compute descriptors ac γ ( G ) ( γ = ( a , b , m ) ∈ Γ) according to the defini-tions in Section 2.2. constants:
A set Γ = Γ < ∪ Γ = ∪ Γ > of proper tuples ( a , b , m ) ∈ Λ × Λ × [1 , = { ( a , b , | a , b ∈ Λ ∪ { ǫ }} ; variables: δ τ ( i, γ ) ∈ { , } , i ∈ [1 , c ∗ ], γ ∈ Γ ∪ Γ : 44 τ ( i, γ ) = 1 ⇔ edge a i is assigned tuple γ ; i.e., γ = ( e α (tail( i ) , , e α (head( i ) , , e β ( i )); δ τ ( t, , γ ) ∈ { , } , t ∈ [2 , t ∗ ], γ ∈ Γ ∪ Γ : δ τ ( t, , γ ) = 1 ⇔ edge e t, is assigned tuple γ ; i.e., γ = ( e α ( s ∗ + t − , , e α ( s ∗ + t, , e β ( t, δ τ ( p, i, γ ) ∈ { , } , p ∈ [1 , s ∗ + t ∗ ], i ∈ [2 , n Stree ], γ ∈ Γ ∪ Γ : δ τ ( p, i, γ ) = 1 ⇔ edge e ′ p,i , p ≤ s ∗ (or e p − s ∗ ,i , p > s ∗ ) is assigned tuple γ ; i.e., γ = ( e α ( p, prt( i )) , e α ( p, i ) , e β ( p, i )); δ b τ ( s, t, γ ) ∈ { , } , s ∈ [1 , s ∗ ], t ∈ [1 , t ∗ ], γ ∈ Γ ∪ Γ : δ b τ ( s, t, γ ) = 1 ⇔ edge u s, v t, is assigned tuple γ ; i.e., γ = ( e α ( s, , e α ( s ∗ + t, , b β ( s, t )); constraints: X γ ∈ Γ ∪ Γ δ τ ( i, γ ) = 1 , X ( a , b ,m ) ∈ Γ ∪ Γ [ a ] δ τ ( i, ( a , b , m )) = e α (tail( i ) , , X ( a , b ,m ) ∈ Γ ∪ Γ [ b ] δ τ ( i, ( a , b , m )) = e α (head( i ) , , X ( a , b ,m ) ∈ Γ ∪ Γ m · δ τ ( i, ( a , b , m )) = e β ( i ) , i ∈ [1 , c ∗ ] , (61) X γ ∈ Γ ∪ Γ δ τ ( t, , γ ) = 1 , X ( a , b ,m ) ∈ Γ ∪ Γ [ a ] δ τ ( t, , ( a , b , m )) = e α ( s ∗ + t − , , X ( a , b ,m ) ∈ Γ ∪ Γ [ b ] δ τ ( t, , ( a , b , m )) = e α ( s ∗ + t, , X ( a , b ,m ) ∈ Γ ∪ Γ m · δ τ ( t, , ( a , b , m )) = e β ( t, , t ∈ [2 , t ∗ ] , (62) X γ ∈ Γ ∪ Γ δ τ ( p, i, γ ) = 1 , X ( a , b ,m ) ∈ Γ ∪ Γ [ a ] δ τ ( p, i, ( a , b , m )) = e α ( p, prt( i )) , X ( a , b ,m ) ∈ Γ ∪ Γ [ b ] δ τ ( p, i, ( a , b , m )) = e α ( p, i ) , X ( a , b ,m ) ∈ Γ ∪ Γ m · δ τ ( p, i, ( a , b , m )) = e β ( p, i ) ,p ∈ [1 , s ∗ + t ∗ ] , i ∈ [2 , n Stree ] , (63) X γ ∈ Γ ∪ Γ δ b τ ( s, t, γ ) = 1 , X ( a , b ,m ) ∈ Γ ∪ Γ [ a ] δ b τ ( s, t, ( a , b , m )) = e α ( s, , X ( a , b ,m ) ∈ Γ ∪ Γ [ b ] δ b τ ( s, t, ( a , b , m )) = e α ( s ∗ + t, , X ( a , b ,m ) ∈ Γ ∪ Γ m · δ b τ ( s, t, ( a , b , m )) = b β ( s, t ) ,s ∈ [1 , s ∗ ] , t ∈ [1 , t ∗ ] , (64) X i ∈ [1 ,c ∗ ] ( δ τ ( i, γ ) + δ τ ( i, γ )) + X s ∈ [1 ,s ∗ ] ,t ∈ [1 ,t ∗ ] ( δ b τ ( s, t, γ ) + δ b τ ( s, t, γ ))+ X t ∈ [2 ,t ∗ ] ( δ τ ( t, , γ ) + δ τ ( t, , γ )) = ac in ( γ ) , γ ∈ Γ < , (65)45 i ∈ [1 ,c ∗ ] δ τ ( i, γ ) + X s ∈ [1 ,s ∗ ] ,t ∈ [1 ,t ∗ ] δ b τ ( s, t, γ ) + X t ∈ [2 ,t ∗ ] δ τ ( t, , γ ) = ac in ( γ ) , γ ∈ Γ = , (66) X p ∈ [1 ,s ∗ + t ∗ ] ,i ∈ [2 ,n Stree ] ( δ τ ( p, i, γ ) + δ τ ( p, i, γ )) = ac ex ( γ ) , γ ∈ Γ < , (67) X p ∈ [1 ,s ∗ + t ∗ ] ,i ∈ [2 ,n Stree ] δ τ ( p, i, γ ) = ac ex ( γ ) , γ ∈ Γ = . (68) B.9 Descriptor for Bond-configuration
We include constraints to compute descriptor for bond-configuration bd µ ( G ), µ ∈ Bc according tothe definition. variables: bc( µ ) ∈ [0 , n ∗ − µ ∈ Bc; δ dc ( i, d, d ′ , m ) ∈ { , } , i ∈ [1 , c ∗ ], d, d ′ ∈ [0 , m ∈ [0 , δ dc ( i, d, d ′ , m ) = 1 ⇔ deg H ( u tail( i ) ) = d , deg H ( u head( i ) ) = d ′ and β ( a i ) = m ∈ [1 ,
3] in G ; δ dc ( t, , d, d ′ , m ) ∈ { , } , t ∈ [2 , t ∗ ], d, d ′ ∈ [0 , m ∈ [0 , δ dc ( t, , d, d ′ , m ) = 1 ⇔ deg H ( v t − , ) = d , deg H ( v t, ) = d ′ and β ( e t, ) = m ∈ [1 ,
3] in G ; δ dc ( p, i, d, d ′ , m ) ∈ { , } , p ∈ [1 , s ∗ + t ∗ ], i ∈ [2 , n Stree ], d, d ′ ∈ [0 , m ∈ [0 , δ dc ( p, i, d, d ′ , m ) = 1 ⇔ deg H ( u p, prt( i ) ) = d deg H ( u p,i ) = d ′ and β ( e ′ p,i ) = m ∈ [1 ,
3] for p ≤ s ∗ (or deg H ( v p − s ∗ , prt( i ) ) = d , deg H ( v p − s ∗ ,i ) = d ′ and β ( e p − s ∗ ,i ) = m ∈ [1 ,
3] for p > s ∗ ) in G ; δ c dc ( s, t, d, d ′ , m ) ∈ { , } , s ∈ [1 , s ∗ ], t ∈ [1 , t ∗ ], d, d ′ ∈ [0 , m ∈ [0 , δ c dc ( s, t, d, d ′ ,
1) = 1 ⇔ deg H ( u s, ) = d , deg H ( v t, ) = d ′ and β ( u s, v t, ) = m ∈ [1 ,
3] in G ; constraints: X d,d ′ ∈ [0 , ,m ∈ [0 , δ dc ( i, d, d ′ , m ) = 1 , X d,d ′ ∈ [0 , ,m ∈ [0 , m · δ dc ( i, d, d ′ , m ) = e β ( i ) , X d ∈ [1 , ,d ′ ∈ [0 , ,m ∈ [0 , d · δ dc ( i, d, d ′ , m ) = deg(tail( i ) , , X d ∈ [0 , ,d ′ ∈ [1 , ,m ∈ [0 , d ′ · δ dc ( i, d, d ′ , m ) = deg(head( i ) , , i ∈ [1 , c ∗ ] , (69) X d,d ′ ∈ [0 , ,m ∈ [0 , δ dc ( t, , d, d ′ , m ) = 1 , X d,d ′ ∈ [0 , ,m ∈ [0 , m · δ dc ( t, , d, d ′ , m ) = e β ( t, , X d ∈ [1 , ,d ′ ∈ [0 , ,m ∈ [0 , d · δ dc ( t, , d, d ′ , m ) = deg( s ∗ + t − , , X d ∈ [0 , ,d ′ ∈ [1 , ,m ∈ [0 , d ′ · δ dc ( t, , d, d ′ , m ) = deg( s ∗ + t, , t ∈ [2 , t ∗ ] , (70)46 d,d ′ ∈ [0 , ,m ∈ [0 , δ dc ( p, i, d, d ′ , m ) = 1 , p ∈ [1 , s ∗ + t ∗ ] , i ∈ [2 , n Stree ] , (71) X d,d ′ ∈ [0 , ,m ∈ [0 , m · δ dc ( s, i, d, d ′ , m ) = e β ( s, i ) , s ∈ [1 , s ∗ ] , i ∈ [2 , n Stree ] , (72) X d,d ′ ∈ [0 , ,m ∈ [0 , m · δ dc ( s ∗ + t, i, d, d ′ , m ) = e β ( s ∗ + t, i ) , t ∈ [1 , t ∗ ] , i ∈ [2 , n Ttree ] , (73) X d ∈ [1 , ,d ′ ∈ [0 , ,m ∈ [0 , d · δ dc ( p, i, d, d ′ , m ) = deg( p, prt( i )) , X d ∈ [0 , ,d ′ ∈ [1 , ,m ∈ [0 , d ′ · δ dc ( t, i, d, d ′ , m ) = deg( p, i ) , p ∈ [1 , s ∗ + t ∗ ] , i ∈ [2 , n Stree ] , (74) X d,d ′ ∈ [1 , ,m ∈ [0 , δ c dc ( s, t, d, d ′ , m ) = 1 , X d,d ′ ∈ [1 , ,m ∈ [0 , m · δ c dc ( s, t, d, d ′ , m ) = b β ( s, t ) , X d ∈ [1 , ,d ′ ∈ [0 , ,m ∈ [0 , d · δ c dc ( s, t, d, d ′ , m ) = deg( s, , X d ∈ [0 , ,d ′ ∈ [1 , ,m ∈ [0 , d ′ · δ c dc ( s, t, d, d ′ , m ) = deg( s ∗ + t, , s ∈ [1 , s ∗ ] , t ∈ [1 , t ∗ ] , (75) X i ∈ [1 ,c ∗ ] ( δ dc ( i, d, d ′ , m )+ δ dc ( i, d ′ , d, m )) + X t ∈ [2 ,t ∗ ] ( δ dc ( t, , d, d ′ , m )+ δ dc ( t, , d ′ , d, m ))+ X s ∈ [1 ,s ∗ ] ,t ∈ [1 ,t ∗ ] ( δ c dc ( s, t, d, d ′ , m ) + δ c dc ( s, t, d ′ , d, m )) = bc in ( µ ) , X p ∈ [1 ,s ∗ + t ∗ ] ,i ∈ [2 ,n Stree ] ( δ dc ( p, i, d, d ′ , m ) + δ dc ( p, i, d ′ , d, m )) = bc ex ( µ ) ,µ = ( d, d ′ , m ) ∈ Bc , d < d ′ , (76) X i ∈ [1 ,c ∗ ] δ dc ( i, d, d, m ) + X t ∈ [2 ,t ∗ ] δ dc ( t, , d, d, m ) + X s ∈ [1 ,s ∗ ] ,t ∈ [1 ,t ∗ ] δ c dc ( s, t, d, d, m ) = bc in ( µ ) , X p ∈ [1 ,s ∗ + t ∗ ] ,i ∈ [2 ,n Stree ] δ dc ( p, i, d, d, m ) = bc ex ( µ ) ,µ = ( d, d, m ) ∈ Bc . (77)47 Descriptions of New Graph Search Algorithms
C.1 Frequency Vectors of Fictitious Trees
Let T be a chemical bi-rooted or tri-rooted tree, where we regard a rooted tree T as a bi-rootedtree with r ( T ) = r ( T ) for a notational convenience. Recall that our algorithm generates a targetgraph G ∈ G ( x ∗ ) as a supergraph of T , where one of terminals r ( T ) and r ( T ) can be a 2-branchof G . We assume that the second terminal r ( T ) will be a 2-branch of G in such a case in ouralgorithms.For an integer p ∈ [1 , T [+ p ] denote a fictitious chemical graph obtained from T byregarding the degree of terminal r ( T ) as deg T ( r ( T ))+ p . Figure 13 (resp., Figure 14(a)) illustratesfictitious trees T [+ p ] in the case of r ( T ) = r ( T ) (resp., r ( T ) = r ( T )). The frequency vectors fff in ( T [+ p ]) and fff ex ( T [+ p ]) are obtained as follows: Let d = deg T ( r ( T )), v i , i ∈ [1 , d ] denote theneighbors of r ( T ), and d i = deg T ( v i ), m i = β ( r ( T ) v i ), and µ i = ( d, d i , m i ), µ ′ i = ( d + p, d i , m i ), i ∈ [1 , d ].For r ( T ) = r ( T ) and d ′ = d + p , fff in ( T [+ p ]) = fff in ( T ) + 111 dg d ′ − dg d , fff ex ( T [+ p ]) = fff ex ( T ) + X ≤ i ≤ d (111 µ ′ i − µ i ) . For r ( T ) = r ( T ) and d ′ = d + p , where v d denotes the vertex in P T , fff in ( T [+1]) = fff in ( T ) + 111 dg d ′ − dg d + 111 µ ′ d − µ d , fff ex ( T [+1]) = fff ex ( T ) + X ≤ i ≤ d − (111 µ ′ i − µ i ) . (a) T [ +
1] (b) T [ + r = r ( T )= r ( T ) (f) T [ + T [ + T [ + T [ + r r r r r d =1 d =1 d =2 d =2 d =3 d =0 (h) T [ + T [ + r r d =1 d =0 Figure 13: An illustration of fictitious rooted trees T [+ p ], p ∈ [1 ,
3] for rooted trees T with r = r ( T ) = r ( T ) and d = deg T ( r ), where a dashed line depicts a fictitious edge incident to theterminal r ( T ) = r ( T ): (a) T [+1] and d = 1; (b) T [+1] and d = 2; (c) T [+1] and d = 3; (d) T [+2]and d = 0; (e) T [+2] and d = 1; (f) T [+2] and d = 2.; (g) T [+3] and d = 0; (h) T [+3] and d = 1.Let T be a chemical tri-rooted tree, where the third terminal r ( T ) is in the backbone path P T between vertices r ( T ) and r ( T ). Let T h +1 i denote a fictitious chemical graph obtained from T by regarding the degree of terminal r ( T ) as deg T ( r ( T )) + 1. Figure 14(b) illustrate a fictitioustri-rooted tree T h +1 i . The frequency vectors fff in ( T h +1 i ) and fff ex ( T h +1 i ) are obtained as follows:Let d = deg T ( r ( T )), v i , i ∈ [1 , d ] denote the neighbors of r ( T ), where v d and v d +1 are containedin the path P T . For each index i ∈ [1 , d ], let d i = deg T ( v i ), m i = β ( r ( T ) v i ), µ i = ( d, d i , m i ) and48 r ( T ) r ( T ) P T (a) T [ + q ] v v v d q= dr ( T ) r ( T ) P T v v v d r ( T ) v d - (b) T + < > Figure 14: An illustration of fictitious trees T [+ q ] and T h +1 i for bi-rooted tree and tri-rootedtrees T : (a) T [+ q ] of a bi-rooted tree T ; (b) T h +1 i of a tri-rooted tree T . µ ′ i = ( d + 1 , d i , m i ).Then fff in ( T h +1 i ) = fff in ( T ) +111 dg( d +1) − dg d + X i ∈ [ d − ,d ] (111 µ ′ i − µ i ) , fff ex ( T h +1 i ) = fff ex ( T ) + X i ≤ [1 ,d − (111 µ ′ i − µ i ) . C.2 Sets of Frequency Vectors
For an element a ∈ Λ and integers d ∈ [0 , d max −
2] and m ∈ [ d, val( a ) − (0)inl ( a , d, m ) (resp.,W (0)inl+3 ( a , d, m )) denote the set of frequency vectors ( fff in ( T [+2]) , fff ex ( T [+2])) (resp., ( fff in ( T [+3]) , fff ex ( T [+3])))of a chemical rooted tree T such that r ( T ) = r ( T ), the height of T is at most 2, α ( r ( T )) = a , deg T ( r ( T )) = d , and β ( r ( T )) = m .Recall that β ( u ) = P uv ∈ E β ( uv ) defined in Section 2.For an element a ∈ Λ and integers d ∈ [1 , d max − m ∈ [ d, val( a ) −
1] and h ≥
0, letW ( h )end ( a , d, m ) (resp., W ( h )end+2 ( a , d, m )) denote the set of frequency vectors ( fff in ( T [+1]) , fff ex ( T [+1]))(resp., ( fff in ( T [+2]) , fff ex ( T [+2]))) of chemical bi-rooted trees T such that α ( r ( T )) = a , deg T ( r ( T )) = d , β ( r ( T )) = m , ℓ ( P T ) = h andif h = 0 then the height of the tree T ′ rooted at r ( T ) is 2. C.3 Case of Two Leaf 2-branches
C.3.1 Step 1: Enumeration of 2-fringe-trees
The main task of Step 1 is to compute for each tuple ( a , d, m ) of an element a ∈ Λ and integers d ∈ [1 , d max −
1] (resp., d ∈ [0 , d max − m ∈ [ d, val( a ) −
1] (resp., m ∈ [ d, val( a ) − (0)end ( a , d, m ) (resp., W (0)inl ( a , d, m )) of all frequency vectors fff ( T [+1]) (resp., fff ( T [+2])) of chemicalrooted trees T such that r ( T ) = r ( T ), α ( r ( T )) = a , deg T ( r ( T )) = d and β ( r ( T )) = m .Step 1 first computes the set F T of all possible chemical rooted trees T ∈ T ( xxx ∗ ) (where r ( T ) = r ( T )) that can be a 2-fringe-tree of a target graph G ∈ G ( x ∗ ). For this, we design abranch-and-bound procedure where we append a new vertex one by one to construct a rooted tree49ith only one child. To design a bounding procedure, we derive a property of the structure ofchemical rooted trees that can be a 2-fringe-tree of a target graphLet G be a chemical rooted tree with a terminal r = r ( G ) = r ( G ), where fff in ( α ( r ); G ) = 1and fff in ( a ; G ) = 0, a ∈ Λ \ { α ( r ) } and fff in ( γ ; G ) = 0, γ ∈ Γ. For a vector xxx = ( xxx in , xxx ex ) with xxx in , xxx ex ∈ Z Λ ∪ Γ ∪ Bc ∪ Dg+ , we call G xxx -extensible if some chemical acyclic graph G ∈ G ( xxx ) contains G as a subgraph of a 2-fringe-tree T rooted at r in G .We use the next condition as a bounding procedure when we generate chemical rooted trees inStep 1. Lemma 3.
For a branch-parameter k , let xxx ∗ = ( xxx ∗ in , xxx ∗ ex ) be a vector with xxx ∗ in , xxx ∗ ex ∈ Z Λ ∪ Γ ∪ Bc ∪ Dg+ ,and G be a chemical rooted tree rooted at a vertex r such that fff ( G ) ≤ xxx ∗ . (i) Graph G is xxx ∗ -extensible only when the next holds for any subset Λ ′ ⊆ Λ : X a ∈ Λ ′ ( xxx ∗ ex ( a ) − fff ex ( a ; G )) ≤ X γ =( a , b ,k ) ∈ Γ: a ∈ Λ ′ , b ∈ Λ \ Λ ′ ( xxx ∗ ex ( γ ) − fff ex ( γ ; G )) + 2 X γ =( a , b ,k ) ∈ Γ: a , b ∈ Λ ′ ( xxx ∗ ex ( γ ) − fff ex ( γ ; G )) . (78)(ii) Let G denote the chemical rooted tree obtained from G by appending a new atom with anelement b ∈ Λ to an atom with an element a ∈ Λ in G with a multiplicity q ; i.e., we joinan atom a in G and a new atom b with an adjacency-configuration ( a , b , q ) . Then G is xxx ∗ -extensible only when the next holds: xxx ∗ ex ( a ) − fff ex ( a ; G ) ≤ nbnbnb ( a ) − for nbnbnb ( a ) = X γ =( a , b ,k ) ∈ Γ: b = a ∈ Λ ( xxx ∗ ex ( γ ) − fff ex ( γ ; G )) + 2 X γ =( a , a ,k ) ∈ Γ ( xxx ∗ ex ( γ ) − fff ex ( γ ; G )) . Proof. (i) Assume that G is a subgraph of a 2-fringe-tree T in some chemical graph G ∈ G ( xxx ∗ ) sothat T is rooted at r . The left-hand side means the number of the remaining k -external verticeswith elements in Λ ′ in the 2-fringe-trees in G . Each of such atoms has a neighbor in the connectedgraph G . The right-hand side indicates an upper bound on the number of k -external edges joiningelements in Λ ′ in the 2-fringe-trees in G .(ii) Note that fff ex[Λ ∪ Γ] ( G ) = fff ex[Λ ∪ Γ] ( G ) + 111 b + 111 γ . For Λ ′ = { a } , the left-hand side in Eq. (78)is xxx ∗ ex ( a ) − fff ex ( a ; G ), which remains unchanged if a = b (resp., reduces by 1 if a = b ); and theright-hand side in (78) is nbnbnb ( a ), which reduces by 1 if a = b (resp., reduces by 2 if a = b ). That is,the left-hand side minus the right-hand side in (78) always reduces by 1. This gives the requirednecessary condition for G to be xxx ∗ -extensible.Figure 15 illustrates all graph structures of rooted trees T with height at most 2 and onlyone child satisfying the size constraint (1). For each element a ∈ Λ, we enumerate chemical trees T ∈ T ( x ∗ ) rooted a vertex r with α ( r ) = a that has only one child by a branch-and-boundalgorithm. Let T a denote the set of resulting rooted trees for each root element a ∈ Λ.50e next enumerate chemical trees T ∈ T ( x ∗ ) rooted a vertex r with α ( r ) = a that has twoor three children by generating a combination of two or three graphs in T a . During generatinggraphs, our bounding procedure tests whether the current graph satisfies the necessary conditionin Lemma 3(ii).Finally we compute the following sets:for each element a ∈ Λ, integers d ∈ [1 , d max − m ∈ [ d, val( a ) − (0)end ( a , d, m ) offrequency vectors fff ( T [+1]) for rooted trees T ∈ T a with deg T ( r ) = d and height 2;for each element a ∈ Λ, integers d ∈ [0 , d max − m ∈ [ d, val( a ) − (0)inl ( a , d, m ) offrequency vectors fff ( T [+2]) for rooted trees T ∈ T a with deg T ( r ) = d and height at most 2.For each vector ∈ W (0)end ( a , d, m ) (resp., ∈ W (0)inl ( a , d, m )), we store a sample tree T . root root root (a) n ( T )=2 (b) n ( T )=3 (c) n ( T )=4 root (d) n ( T )=5 d =1 d =1 d =1 d =1 Figure 15: An illustration of rooted trees T with height at most 2 and only one child satisfyingthe size constraint: (a) case of n ( T ) = 2; (b) case of n ( T ) = 3; (c) case of n ( T ) = 4; (d) case of n ( T ) = 5. C.3.2 Step 2: Generation of Frequency Vectors of End-subtrees
The main task of Step 2 is to compute the following sets in the ascending order of h = 1 , , . . . , δ :for elements a ∈ Λ, integers d ∈ [1 , d max − m ∈ [ d, val( a ) −
1] and h ∈ [1 , δ ], the sets W ( h )end ( a , d, m )of all frequency vectors fff ( T [+1]) of chemical bi-rooted trees T ∈ T ( x ∗ ) such that α ( r ( T )) = a ,deg T ( r ( T )) = d , β ( r ( T )) = m and ℓ ( P T ) = h .Observe that each vector = ( in ex ) ∈ W ( h )end ( a , d, m ) is obtained from a combination ofvectors ′ = ( ′ in ′ ex ) ∈ W (0)inl ( a , d − , m ′ ) and ′′ = ( ′′ in ′′ ex ) ∈ W ( h − ( b , d ′′ , m ′′ ) such that m ′ ≤ val( a ) − , ≤ m − m ′ ≤ val( b ) − m ′′ in = ′ in + ′′ in + 111 γ + 111 µ ≤ xxx ∗ in ex = ′ ex + ′′ ex ≤ xxx ∗ ex for γ = ( a , b , m − m ′ ) ∈ Γ and µ = ( d + 2 , d ′′ + 1 , m − m ′ ) ∈ Bc . Figure 16 illustrates this process of computing a vector ∈ W ( h )end ( a , d, m ).For each vector ∈ W ( h )end ( a , d, m ) obtained from a combination ′ ∈ W (0)inl ( a , d − , m ′ ) and ′′ ∈ W ( h − ( b , d ′′ , m ′′ ), we construct a sample tree T from their sample trees T ′ and T ′′ . C.3.3 Step 3: Enumeration of Feasible Vector Pairs A feasible pair of vectors is defined to be a pair of vectors i = ( i in i ex ) ∈ W ( δ i )end ( a i , d i , m i ), a i ∈ Λ, d i ∈ [1 , d max − m i ∈ [ d i , val( a i ) − i = 1 , m’’ m-m’ b m’ T’ [ + T’’ [ + h -1 d -1 d’’ Figure 16: An illustration of appending a rooted tree T ′ to a bi-rooted tree T ′′ to compute a vector ∈ W ( h )end ( a , d, m ) from the frequency vectors ′ = fff ( T ′ [+2]) ∈ W (0)inl ( a , d − , m ′ ) of a rooted tree T ′ and ′′ = fff ( T ′′ [+1]) ∈ W ( h − ( b , d ′′ , m ′′ ) of a bi-rooted tree T ′′ . γ = ( a , a , m ) ∈ Γ and a bond-configuration µ = ( d + 1 , d + 1 , m ) ∈ Bc with an integer m ∈ [1 , min { , val( a ) − m , val( a ) − m } ] such that xxx ∗ in = + + 111 γ + 111 µ and xxx ∗ ex = + ,or equivalently is equal to the vector ( xxx ∗ in − − γ − µ , xxx ∗ ex − ), which we call the ( γ, µ ) -complement of , and denote it by .The main task of Step 3 is to enumerate all feasible vector pairs ( ), i ∈ W ( δ i )end ( a i , d i , m i )with a i ∈ Λ, d i ∈ [1 , d max − m i ∈ [ d i , val( a i ) − i = 1 , ( δ i )end ( a i , d i , m i ), i = 1 ,
2, wefirst compute the ( γ, µ )-complement vector of each vector ∈ W ( δ )end ( a , d , m ) for each pair of γ = ( a , a , m ) ∈ Γ and µ = ( d +1 , d +1 , m ) ∈ Bc with m ∈ [1 , min { , val( a ) − m , val( a ) − m } ],and denote by W ( δ )end the set of the resulting ( γ, µ )-complement vectors. Observe that ( ) is afeasible vector pair if and only if = . To find such pairs, we merge the sets W ( δ )end ( a , d , m )and W ( δ )end into a sorted list L γ,µ . Then each consecutive pair of vectors zzz , zzz ∈ L γ,µ gives a feasiblepair of vectors zzz and zzz . C.3.4 Step 4: Construction of Chemical Graphs
The task of Step 4 is to construct for each feasible vector pair i ∈ W ( δ i )end ( a i , d i , m i ), i = 1 , is equal to the ( γ = ( a , a , m ) , µ )-complement vector of , construct a target graph T ( ) ∈ G ( xxx ∗ ) by combining the sample trees T i = T i of vectors i with an edge e = r ( T ) r ( T )such that β ( e ) = m . Figure 8 illustrates two sample trees T i , i = 1 , e = r ( T ) r ( T ). C.4 Case of Three Leaf 2-branches
C.4.1 Step 1: Enumeration of 2-fringe-trees
The main task of Step 1 is to compute the following sets:for each tuple ( a , d, m ) of an element a ∈ Λ and integers d ∈ [1 , d max −
1] (resp., d ∈ [0 , d max − d ∈ [0 , d max − m ∈ [ d, val( a ) −
1] (resp., m ∈ [ d, val( a ) −
2] and m ∈ [ d, val( a ) − (0)end ( a , d, m ) (resp., W (0)inl ( a , d, m ) and W (0)inl+3 ( a , d, m )) of all frequency vectors fff ( T [+1])(resp., fff ( T [+2]) and fff ( T [+3])) of chemical rooted trees T such that r ( T ) = r ( T ), α ( r ( T )) = a ,deg T ( r ( T )) = d and β ( r ( T )) = m . For each vector ∈ W (0)end ( a , d, m ) (resp., ∈ W (0)inl ( a , d, m )and ∈ W (0)inl+3 ( a , d, m )), we store a sample tree T . This step can be designed in a similar wayof Step 1 for the case of bl ( G ) = 2. C.4.2 Step 2: Generation of Frequency Vectors of End-subtrees
Analogously with Step 2 for the case of bl ( G ) = 2, Step 2 computes the following sets in theascending order of h = 1 , , . . . , dia ∗ − − δ :for elements a ∈ Λ, integers d ∈ [1 , d max − m ∈ [ d, val( a ) − i = 1 , h ∈ [1 , dia ∗ − − δ ],the sets W ( h )end ( a , d, m ) of all frequency vectors fff ( T [+1]) of chemical bi-rooted trees T ∈ T ( x ∗ ) suchthat α ( r ( T )) = a , deg T ( r ( T )) = d , β ( r ( T )) = m and ℓ ( P T ) = h .For each vector ∈ W ( h )end ( a , d, m ), we construct a sample tree T from their sample trees T ′ and T ′′ . C.4.3 Step 3: Generation of Frequency Vectors of End-subtrees with Two FictitiousEdges
The main task of Step 3 is to compute the following sets:for elements a ∈ Λ, integers d ∈ [1 , d max − m ∈ [ d, val( a ) −
2] and h ∈ [ ⌈ dia ∗ / ⌉ − , dia ∗ − − δ ],the sets W ( h )end+2 ( a , d, m ) of all frequency vectors of bi-rooted trees T [+2] such that α ( r ( T )) = a ,deg T ( r ( T )) = d , β ( r ( T )) = m and ℓ ( P T ) = h . For each vector ∈ W ( h )end+2 ( a , d, m ), we store asample tree T . This step can be designed in a similar way of Step 3 for the case of bl ( G ) = 2. C.4.4 Step 4: Enumeration of Frequency Vectors of Main-subtrees
For an element a ∈ Λ, and integers d ∈ [2 , d max − m ∈ [ d, val( a ) − δ ∈ [ ⌈ dia ∗ / ⌉ − , dia ∗ − − δ ], define W ( δ +1)main ( a , d, m ) to be the set of the frequency vectors fff ( T h +1 i ) of chemicaltri-rooted trees T such that α ( r ( T )) = a , deg T ( r ( T )) = d , β ( r ( T )) = m , ℓ ( P T ) = dia ∗ − P r ( T ) ,r ( T ) between vertices r ( T ) and r ( T ) is δ + 1.See Figure 9 for the structure of a main-tree. Such a chemical tri-rooted graph T corresponds tothe main-subtree of a target graph G ∈ G ( x ∗ ).The main task of Step 4 is to compute the sets W ( δ +1)main ( a , d, m ), a ∈ Λ, d ∈ [2 , d max − m ∈ [ d, val( a ) − δ ∈ [ ⌈ dia ∗ / ⌉ − , dia ∗ − − δ ]. Each vector ∈ W ( δ +1)main ( a , d, m ) can beobtained from a combination of vectors ∈ W ( δ +1)end+2 ( a , d − , m ′′ ) and ∈ W ( δ )end ( a ′ , d ′ , m ′ ) suchthat δ + δ = dia ∗ − δ ≥ δ , as illustrated in Figure 17. For each vector ∈ W ( δ +1)main ( a , d, m ),we store a sample tree T . This step can be designed in a similar way of Step 3 for the case ofbl ( G ) = 2. 53 [ + length = d length = d a’a d’d -1 m’m ’’ r ( T ) r ( T ) r ( T ) r ( T ) T [ + +1 Figure 17: An illustration of computing the frequency vector = fff ( T h +1 i ) ∈ W ( δ +1)main ( a , d, m )of a tri-rooted tree T from the frequency vectors = fff ( T [+2]) ∈ W ( δ +1)end+2 ( a , d − , m ′′ ) and = fff ( T [+1]) ∈ W ( δ )end ( a ′ , d ′ , m ′ ) for bi-rooted trees T and T . C.4.5 Step 5: Enumeration of Feasible Vector Pairs
Analogously with the case of bl ( G ) = 2, a feasible pair of vectors is defined to be a pair of vectors = ( ) ∈ W ( δ +1)main ( a , d , m ), and = ( ) ∈ W ( δ )end ( a , d , m ), δ ∈ [ ⌈ dia ∗ / ⌉ − , dia ∗ − − δ ], a i ∈ Λ, d i ∈ [1 , d max − m i ∈ [ d i , val( a i ) − i = 1 , γ = ( a , a , m ) ∈ Γ and a bond-configuration µ = ( d + 1 , d + 1 , m ) ∈ Bc with aninteger m ∈ [1 , min { , val( a ) − m , val( a ) − m } ] such that xxx ∗ in = + + 111 γ + 111 µ and xxx ∗ ex = + .Step 5 computes the set all feasible vector pairs ( ) by using a sorting algorithm as in theStep 4 for the case of bl ( G ) = 2. C.4.6 Step 6: Construction of Chemical Graphs
Analogously with Step 4 for the case of bl ( G ) = 2, Step 6 constructs a target graph T ( ) ∈ G ( xxx ∗ )for each feasible vector pair ( ) by combining the sample trees T i = T i of vectors i with anew edge e = r ( T ) r ( T2