Least resolved trees for two-colored best match graphs
David Schaller, Manuela Gei?, Marc Hellmuth, Peter F. Stadler
LLeast resolved trees for two-colored best match graphs
David Schaller , Manuela Geiß , Marc Hellmuth , and Peter F. Stadler Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany [email protected] Bioinformatics Group, Department of Computer Science & Interdisciplinary Center for Bioinformatics, Universit¨at Leipzig,H¨artelstraße 16–18, D-04107 Leipzig, Germany [email protected] Software Competence Center Hagenberg GmbH, Softwarepark 21, A-4232 Hagenberg, Austria [email protected] Department of Mathematics, Faculty of Science, Stockholm University, SE - 106 91 Stockholm, Sweden [email protected] Institute for Theoretical Chemistry, University of Vienna, W¨ahringerstraße 17, A-1090 Wien, Austria Facultad de Ciencias, Universidad Nacional de Colombia, Bogot´a, Colombia The Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501, USA
Abstract O ( | V | + | E | log | V | )-time algorithm torecognize 2-BMGs and to construct its LRT. The approach can be extended to also recognizebinary-explainable 2-BMGs with the same complexity. An empirical comparison emphasizes theefficiency of the new algorithm. Best match graphs recently have been introduced in phylogenetic combinatorics to formalize thenotion of a gene y in species 2 being an evolutionary closest relative of a gene x in species 1, i.e., y is a best match for x [5]. The best matches between genes of two species form a bipartite directedgraph, the 2-colored best match graph or 2-BMG, that is determined by the phylogenetic treedescribing the evolution of the genes. 2-BMGs are characterized by four local properties [5, 8] thatrelate them to previously studied classes of digraphs: Definition 1.
A bipartite digraph (cid:126)G = (
L, E ) is a 2-BMG if it satisfies (N0) Every vertex has at least one out-neighbor, i.e., (cid:126)G is sink-free . (N1) If u and v are two independent vertices, then there exist no vertices w and t such that ( u, t ) , ( v, w ) , ( t, w ) ∈ E . (N2) For any four vertices u , u , v , v with ( u , v ) , ( v , u ) , ( u , v ) ∈ E we have ( u , v ) ∈ E ,i.e., (cid:126)G is bi-transitive . (N3) For any two vertices u and v with a common out-neighbor, if there exists no vertex w suchthat either ( u, w ) , ( w, v ) ∈ E , or ( v, w ) , ( w, u ) ∈ E , then u and v have the same in-neighborsand either all out-neighbors of u are also out-neighbors of v or all out-neighbors of v are alsoout-neighbors of u . Sink-free graphs have appeared in particular in the context of graph semigroups [1] and graphorientation problems [3]. Bi-transitive graphs were introduced in [4] in the context of orientedbipartite graphs and investigated in more detail in [7, 8]. The class of graphs satisfying (N1), (N2),and (N3) are characterized by a system of forbidden induced subgraphs [12], see Thm. 2 below.In general, best match graphs (BMGs) are defined as vertex-colored digraphs ( (cid:126)G, σ ), where thevertex coloring σ assigns to each gene x the species σ ( x ) in which it resides. The subgraphs of a a r X i v : . [ q - b i o . P E ] J a n G, σ ) b b b b b a a a a b b b b a a a b a (T , σ ) * Figure 1:
Example for a 2-BMG ( (cid:126)G, σ ) and its explaining least resolved tree ( T ∗ , σ ). BMG induced by vertices of two distinct colors form a 2-BMG. Note that in this context the vertexcoloring is assigned a priori , while Def. 1 induces a coloring that is unique only up to relabelingof the colors independently on each (weakly) connected component of (cid:126)G . For each BMG ( (cid:126)G, σ ),there is a unique least resolved leaf-colored tree ( T ∗ , σ ) with leaves corresponding to the verticesof ( (cid:126)G, σ ) such that the arcs in ( (cid:126)G, σ ) are the best matches w.r.t. ( T ∗ , σ ) (cf. Def. 2 below). Fig. 1shows an example for a 2-BMG together with its least resolved tree. Using certain sets of rootedtriples that can be inferred from the 2-colored induced subgraphs of ( (cid:126)G, σ ) with three vertices, it ispossible to determine whether ( G, σ ) is a BMG in polynomial time and, if so, to construct the leastresolved tree ( T ∗ , σ ) [5, 9]. This work also describes O ( | V | )-time algorithms for the recognition of2-BMGs and the construction of the LRT for a given 2-BMG.In this contribution, we derive an alternative characterization of 2-BMGs that avoids the useof rooted triples. This will give rise to an alternative, efficient algorithm for the recognition of2-BMGs and the construction of the least resolved tree. The contribution is organized as follows:In Sec. 2, we introduce the necessary notation and review some results from the published literaturethat are needed later on. Sec. 3 is concerned with a more detailed analysis of the least resolvedtrees (LRTs) of BMGs with an arbitrary number of colors. We then turn to the peculiar propertiesof the LRTs of 2-BMGs in Sec. 4. To this end, we introduce the concept of “support leaves” thatuniquely determine the LRT. The main result of this section is Thm. 4, which shows that thesupport leaves of the root can be identified directly in the 2-BMG. In Sec. 5, we then turn Thm. 4into an efficient algorithm for recognizing 2-BMGs and constructing their LRTs. Computationalexperiments demonstrate the performance gain in practise. In Sec. 6 we extend the algorithmicapproach to binary-explainable 2-BMGs, a subclass that features an additional forbidden inducedsubgraph. Let T = ( V, E ) be a tree with root ρ and leaf set L := L ( T ) ⊂ V . The set of inner vertices of T is V ( T ) := V \ L , in particular ρ is an inner vertex. An edge e = uv ∈ E ( T ) is called an inner edge of T if u and v are both inner vertices. Otherwise it is called an outer edge. We consider leaf-coloredtrees ( T, σ ) and write σ ( L (cid:48) ) := { σ ( v ) | v ∈ L (cid:48) } for subsets L (cid:48) ⊆ L . A vertex u ∈ V is an ancestor of v ∈ V in T , in symbols v (cid:22) T u , if u lies on the path from ρ to v . For the edges uv ∈ E ( T ) we usethe convention that uv ∈ E , v ≺ T u , v is a child of u . We write child T ( u ) for the set of childrenof u in T and T ( u ) for the subtree of T rooted in u . The least common ancestor lca T ( A ) is theunique (cid:22) T -smallest vertex that is an ancestor of all genes in A . Writing lca T ( x, y ) := lca T ( { x, y } ),we have Definition 2.
Let ( T, σ ) be a leaf-colored tree. A leaf y ∈ L ( T ) is a best match of the leaf x ∈ L ( T ) if σ ( x ) (cid:54) = σ ( y ) and lca( x, y ) (cid:22) T lca( x, y (cid:48) ) holds for all leaves y (cid:48) of color σ ( y (cid:48) ) = σ ( y ) . Given (
T, σ ), the graph (cid:126)G ( T, σ ) = (
V, E ) with vertex set V = L ( T ), vertex coloring σ , and witharcs ( x, y ) ∈ E if and only if y is a best match of x w.r.t. ( T, σ ) is called the best match graph (BMG) of (
T, σ ) [5].
Definition 3.
An arbitrary vertex-colored graph ( (cid:126)G, σ ) is a best match graph (BMG) if there existsa leaf-colored tree ( T, σ ) such that ( (cid:126)G, σ ) = (cid:126)G ( T, σ ) . In this case, we say that ( T, σ ) explains ( (cid:126)G, σ ) . Theorem 1 ([5], Thm. 9) . If ( (cid:126)G, σ ) is a BMG, then there is a unique least-resolved tree ( T, σ ) thatexplains ( (cid:126)G, σ ) . e say that ( (cid:126)G, σ ) is an (cid:96) -BMG if σ : V ( (cid:126)G ) → S is surjective and | S | = (cid:96) . Given a directed graph (cid:126)G = ( V, E ) we denote the set of out-neighbors of a vertex x ∈ V by N ( x ) := { y ∈ V | ( x, y ) ∈ E ( (cid:126)G ) } and the out-degree | N ( x ) | of x by outdeg( x ). Similarly, N − ( x ) := { y ∈ V | ( y, x ) ∈ E ( (cid:126)G ) } denotesthe set of in-neighbors. By construction, the coloring σ of a BMGs ( (cid:126)G, σ ) is proper , i.e., x ∈ N ( y )implies σ ( x ) (cid:54) = σ ( y ), and there is at least one best match of x for every color s ∈ σ ( V ) \ { σ ( x ) } . Inparticular, therefore, we have N ( x ) (cid:54) = ∅ for every 2-BMG, i.e., every 2-BMG is sink-free. Note thatBMGs will in general have sources, i.e., N − ( x ) may be empty. We write (cid:126)G [ W ] for the subgraph of (cid:126)G = ( V, E ) induced by W ⊆ V and (cid:126)G − W for (cid:126)G [ V \ W ]. A directed graph is (weakly) connectedif its underling undirected graph is connected. A connected component is a maximal connectedsubgraph of (cid:126)G .Following [13] we say that T (cid:48) is displayed by T , in symbols T (cid:48) ≤ T , if the tree T (cid:48) can be obtainedfrom a subtree of T by contraction of edges. For leaf-colored trees we say that ( T, σ ) displays or isa refinement of ( T (cid:48) , σ (cid:48) ), whenever T (cid:48) ≤ T and σ ( v ) = σ (cid:48) ( v ) for all v ∈ L ( T (cid:48) ). Definition 4.
An edge e ∈ E ( T ) is redundant with respect to (cid:126)G ( T, σ ) if the tree T e obtained bycontracting the edge e satisfies (cid:126)G ( T e , σ ) = (cid:126)G ( T, σ ) . We will need the following characterization of redundant edges:
Lemma 1 ([11], Lemma 2.10) . Let ( (cid:126)G, σ ) be a BMG explained by a tree ( T, σ ) . The edge e = uv in ( T, σ ) is redundant w.r.t. ( (cid:126)G, σ ) if and only if (i) e is an inner edge of T and (ii) there is no arc ( a, b ) ∈ E ( (cid:126)G ) such that lca T ( a, b ) = v and σ ( b ) ∈ σ ( L ( T ( u )) \ L ( T ( v ))) . In the following we will frequently need the restriction of the coloring σ on (cid:126)G or L ( T ) to asubset of vertices or leaves. Since in situations like ( G i , σ | V ( G i ) ) the set to which σ is restricted isclear, we will write σ | . to keep the notation less cluttered.BMGs can also be understood in terms of their connected components: Proposition 1 ([5], Prop. 1) . A digraph ( (cid:126)G, σ ) is an (cid:96) -BMG if and only if all its connectedcomponents are (cid:96) -BMGs. As a simple consequence of Prop. 1 and by definition of (cid:96) -BMGs, all connected components( G i , σ | . ) and ( G j , σ | . ) of an (cid:96) -BMG satisfy σ ( V ( G i )) = σ ( V ( G j )) and | σ ( V ( G j )) | = (cid:96) . For ourpurposes it will also be important to relate the structure of a tree ( T, σ ) to the connectedness ofthe BMG (cid:126)G ( T, σ ) that it explains.
Proposition 2 ([5], Thm. 1) . Let ( T, σ ) be a leaf-labeled tree and (cid:126)G ( T, σ ) its BMG. Then (cid:126)G ( T, σ ) is connected if and only if there is a child v of the root ρ such that σ ( L ( T ( v ))) (cid:54) = σ ( L ( T )) . Fur-thermore, if (cid:126)G ( T, σ ) is not connected, then for every connected component (cid:126)G i of (cid:126)G ( T, σ ) there is achild v of the root ρ such that V ( G i ) ⊆ L ( T ( v )) . Moreover, 2-BMGs can be characterized by three types of forbidden subgraphs [12]. To thisend we will need the following classes of small bipartite graphs:
Definition 5 (F1-, F2-, and F3-graphs) . (F1) A properly 2-colored graph on four distinct vertices V = { x , x , y , y } with coloring σ ( x ) = σ ( x ) (cid:54) = σ ( y ) = σ ( y ) is an F1-graph if ( x , y ) , ( y , x ) , ( y , x ) ∈ E and ( x , y ) , ( y , x ) / ∈ E . (F2) A properly 2-colored graph on four distinct vertices V = { x , x , y , y } with coloring σ ( x ) = σ ( x ) (cid:54) = σ ( y ) = σ ( y ) is an F2-graph if ( x , y ) , ( y , x ) , ( x , y ) ∈ E and ( x , y ) / ∈ E . (F3) A properly 2-colored graph on five distinct vertices V = { x , x , y , y , y } with coloring σ ( x ) = σ ( x ) (cid:54) = σ ( y ) = σ ( y ) = σ ( y ) is an F3-graph if ( x , y ) , ( x , y ) , ( x , y ) , ( x , y ) ∈ E and ( x , y ) , ( x , y ) / ∈ E . Theorem 2 ([12], Thm. 3.4) . A properly 2-colored graph is a 2-BMG if and only if it is sink-freeand does not contain an induced F1-, F2-, or F3-graph.
As noted in [12], the forbidden induced F1-, F2-, and F3-subgraphs characterize exactly theclass of bipartite directed graphs satisfying the Axioms (N1), (N2), and (N3) mentioned in theintroduction.Although we aim at avoiding the use of triples in the final results, we will need them duringour discussion. A triple ab | c is a rooted tree t on three pairwise distinct vertices { a, b, c } such thatlca t ( a, b ) ≺ t lca t ( a, c ) = lca t ( b, c ) = ρ , where ρ denotes the root of t . A set R of triples is consistent f there is a tree T that displays all triples in R . Given a vertex-colored graph ( (cid:126)G, σ ), we define itsset of informative triples [5, 11] as R ( (cid:126)G, σ ) := (cid:110) ab | b (cid:48) : σ ( a ) (cid:54) = σ ( b ) = σ ( b (cid:48) ) , ( a, b ) ∈ E ( (cid:126)G ); ( a, b (cid:48) ) / ∈ E ( (cid:126)G ) (cid:111) . (1) Lemma 2 ([11], Lemma 2.8 and 2.9) . If ( (cid:126)G, σ ) is a BMG, then every tree ( T, σ ) that explains ( (cid:126)G, σ ) displays all triples t ∈ R ( (cid:126)G, σ ) .Moreover, if the triples ab | b (cid:48) and cb (cid:48) | b are informative for ( (cid:126)G, σ ) , then every tree ( T, σ ) that explains ( (cid:126)G, σ ) contains two distinct children v , v ∈ child T (lca T ( a, c )) such that a, b ≺ T v and b (cid:48) , c ≺ T v . Observation 3.
Let ( T, σ ) be a tree explaining the BMG ( (cid:126)G, σ ) , and v ∈ V ( T ) a vertex such that σ ( L ( T ( v ))) = σ ( L ( T )) . Then ( a, b ) ∈ E ( (cid:126)G ) and a ∈ L ( T ( v )) implies b ∈ L ( T ( v )) . Finally, there is a close connection between subtrees of T and subgraphs of (cid:126)G ( T, σ ). We have
Lemma 3 ([9], Lemma 22 and 23) . Let ( T, σ ) be a tree explaining an BMG ( (cid:126)G, σ ) . Then (cid:126)G ( T ( u ) , σ | . ) = ( (cid:126)G [ L ( T ( u ))] , σ | . ) holds for every u ∈ V ( T ) . Moreover, if ( T, σ ) is least resolvedfor ( (cid:126)G, σ ) , then the subtree T ( u ) is least resolved for (cid:126)G ( T ( u ) , σ | . ) . In this short section we derive some helpful properties of LRTs which we will use repeatedlythroughout this work.
Lemma 4.
Let ( (cid:126)G, σ ) be a BMG and ( T, σ ) its least resolved tree. Then the BMG (cid:126)G ( T ( v ) , σ | . ) isconnected for every v ∈ V ( T ) with v ≺ T ρ T .Proof. By Lemma 3, (cid:126)G ( T ( v ) , σ | . ) is a BMG. First observe that the BMG (cid:126)G ( T ( v ) , σ | . ) is triviallyconnected if v is a leaf. Now let v ≺ T ρ T be an arbitrary inner vertex of T . Thus, there existsa vertex u (cid:31) T v such that uv is an inner edge. Since ( T, σ ) is least resolved, it does not containany redundant edges. Hence, by contraposition of Lemma 1, there is an arc ( a, b ) ∈ E ( (cid:126)G ) suchthat lca T ( a, b ) = v and σ ( b ) ∈ σ ( L ( T ( u )) \ L ( T ( v ))). Since a, b ∈ L ( T ( v )), Lemma 3 implies that( a, b ) is also an arc in (cid:126)G ( T ( v ) , σ | . ). Moreover, lca T ( v ) ( a, b ) = v clearly also holds in the subtreerooted at v . Now consider the child w ∈ child T ( v ) ( v ) such that a (cid:22) T ( v ) w . There cannot be aleaf b (cid:48) ∈ L ( T ( w )) with σ ( b (cid:48) ) = σ ( b ) since otherwise lca T ( v ) ( a, b (cid:48) ) (cid:22) T ( v ) w ≺ T ( v ) v would contradictthat ( a, b ) is an arc in (cid:126)G ( T ( v ) , σ | . ). Thus σ ( b ) / ∈ σ ( L ( T ( w ))). Since σ ( b ) ∈ σ ( L ( T ( v ))), we thusconclude σ ( L ( T ( w ))) (cid:54) = σ ( L ( T ( v ))). The latter together with Prop. 2 implies that (cid:126)G ( T ( v ) , σ | . ) isconnected.The converse of Lemma 4, however, is not true, i.e., a tree ( T, σ ) for which (cid:126)G ( T ( v ) , σ | . ) isconnected for every v ∈ V ( T ) with v ≺ T ρ T is not necessarily least resolved. To see this, considerthe caterpillar tree ( T, σ ) given by ( x (cid:48)(cid:48) , ( x (cid:48) , ( x, y ))) with σ ( x ) = σ ( x (cid:48) ) = σ ( x (cid:48)(cid:48) ) (cid:54) = σ ( y ) and u =lca T ( x, x (cid:48) ). It is an easy task to verify that the BMG of each subtree of T is connected. However,the edge ρ T u is redundant. Lemma 5.
Let ( T, σ ) be the least resolved tree of some BMG ( (cid:126)G, σ ) . Then every vertex v ≺ T ρ T with | σ ( L ( T ( v ))) | = 1 is a leaf.Proof. Let v ≺ T ρ T with | σ ( L ( T ( v ))) | = 1 and assume, for contradiction, that v is not a leaf.Hence, | L ( T ( v )) | >
1. By Lemma 3 (cid:126)G ( T ( v ) , σ | . ) is a BMG and, therefore, properly colored. Butthen (cid:126)G ( T ( v ) , σ | . ) is disconnected; a contradiction to Lemma 4.As a consequence we find Corollary 1.
Let ( T, σ ) be the least resolved tree of some BMG ( (cid:126)G, σ ) . Then any vertex v ∈ V ( T ) with v ≺ T ρ T is an inner vertex if and only if | σ ( L ( T ( v ))) | > .Proof. If | σ ( L ( T ( v ))) | = 1, Lemma 5 implies that v is a leaf. Otherwise, if | σ ( L ( T ( v ))) | > T ( v )clearly must contain at least two leaves and thus v cannot be a leaf. Support Leaves
In this section we introduce “support leaves” as a means to recursively construct the LRT of a2-BMG. The main result of this section shows that these leaves can be inferred directly from theBMG without any further knowledge of the corresponding LRT. We start with a technical resultsimilar to Cor. 3 in [5]; here we use a much simpler, more convenient notation.
Lemma 6.
Let ( T, σ ) be the least resolved tree of a 2-colored BMG ( (cid:126)G, σ ) . Then, for every vertex u ∈ V ( T ) \ { ρ T } , it holds child T ( u ) ∩ L ( T ) (cid:54) = ∅ . If ( (cid:126)G, σ ) is connected, then child T ( u ) ∩ L ( T ) (cid:54) = ∅ holds for every u ∈ V ( T ) .Proof. Suppose first that ( (cid:126)G, σ ) is disconnected and let u ∈ V ( T ) \ { ρ T } . Since ( T, σ ) is leastresolved, Lemma 4 implies that (cid:126)G ( T ( u ) , σ | . ) is connected for every u ∈ V ( T ) with u ≺ T ρ T . Hence,we can apply Prop. 2 to (cid:126)G ( T ( u ) , σ | . ) and conclude that there is a child v ∈ child T ( u ) ( u ) suchthat σ ( L ( T ( v ))) (cid:54) = σ ( L ( T ( u ))), hence in particular σ ( L ( T ( v ))) (cid:40) σ ( L ( T ( u ))). Since ( T, σ ) is 2-colored, the latter immediately implies | σ ( L ( T ( v ))) | = 1 and, by Cor. 1, v is a leaf. Thus every u ∈ V ( T ) \ { ρ T } has a leaf v among its children, i.e. child T ( u ) ∩ L ( T ) (cid:54) = ∅ . If in addition ( (cid:126)G, σ ) isconnected, we can apply the same argumentation to u = ρ T and conclude that a leaf v is attachedto ρ T .Lemma 6 states that, in the least resolved tree of a connected 2-colored BMG, every inner vertex u is adjacent to at least one leaf, and thus in a way “supported” by it. Definition 6 (Support Leaves) . For a given tree T , the set S u := child T ( u ) ∩ L ( T ) is the set of all support leafs of vertex u ∈ V ( T ) . Note that Lemma 6 is in general not true for (cid:96) -BMGs with (cid:96) ≥
3, as exemplified by the (least-resolved) tree (( a, b ) , ( c, a (cid:48) )) with three distinct leaf colors σ ( a ) = σ ( a (cid:48) ) (cid:54) = σ ( b ) (cid:54) = σ ( c ).As a simple consequence of Prop. 2 and Cor. 1, we find Corollary 2.
Let ( T, σ ) be the least resolved tree (with root ρ ) of some 2-colored BMG (cid:126)G ( T, σ ) .Then, (cid:126)G ( T, σ ) is connected if and only if S ρ (cid:54) = ∅ .Proof. By Prop. 2, (cid:126)G ( T, σ ) is connected if and only if there exists a child v of the root ρ of T , v ∈ child T ( ρ ), such that T ( v ) does not contain all colors. Thus | σ ( L ( T ( v ))) | = 1. By Cor. 1, wehave | σ ( L ( T ( v ))) | = 1 if and only if v is a leaf, i.e. v ∈ S ρ . Hence, (cid:126)G ( T, σ ) is connected if and onlyif S ρ (cid:54) = ∅ . Lemma 7.
Let ( T, σ ) be the least resolved tree of a 2-BMG ( (cid:126)G, σ ) , and S ρ the set of support leavesof the root ρ . Then the connected components of ( (cid:126)G − S ρ , σ | . ) are exactly the BMGs (cid:126)G ( T ( v ) , σ | . ) with v ∈ child( ρ ) \ S ρ .Proof. Let v ∈ child T ( ρ ) ∩ V ( T ) = child T ( ρ ) \ S ρ and consider the BMG (cid:126)G ( T ( v ) , σ | . ). By Lemma 4and Lemma 3, (cid:126)G ( T ( v ) , σ | . ) is connected and we have (cid:126)G ( T ( v ) , σ | . ) = ( (cid:126)G [ L ( T ( v ))] , σ | . ). Moreover, itholds (( (cid:126)G − S ρ )[ L ( T ( v ))] , σ | . ) = ( (cid:126)G [ L ( T ( v ))] , σ | . ) since L ( T ( v )) = V ( (cid:126)G [ L ( T ( v ))]) = V ( H [ L ( T ( v )))]for H := (cid:126)G − S ρ = (cid:126)G [ V ( (cid:126)G ) \ S ρ ].If child T ( ρ ) \ S ρ = { v } , then the statement is trivially satisfied. Therefore, suppose that | child T ( ρ ) \ S ρ | >
1. Hence, it remains to show that there are no arcs between (cid:126)G ( T ( v ) , σ | . ) and (cid:126)G ( T ( w ) , σ | . ) for any w ∈ child T ( ρ ) \ S ρ , w (cid:54) = v . Cor. 1 and v ≺ T ρ imply that T ( v ) contains bothcolors. Thus, by Obs. 3, there are no out-arcs to any vertex in L ( T ) \ L ( T ( v )), hence in particularthere are no out-arcs ( x, y ) with x (cid:22) T v , y (cid:22) T w . By symmetry, the same holds for w , thus wecan conclude that there are no arcs ( y, x ). From the observation that each x ∈ L ( T ) \ S ρ must belocated below some v ∈ child T ( ρ ) ∩ V ( T ), it now immediately follows that ( (cid:126)G − S ρ , σ | . ) consistsexactly of these connected components as stated.As a consequence, we have Corollary 3.
Let ( T, σ ) with root ρ be the LRT of a 2-BMG ( (cid:126)G, σ ) . Then each child of ρ is eitherone of the support leaves S ρ of ρ or the root of the LRT for a connected component of ( (cid:126)G − S ρ , σ | . ) . roof. Let (
T, σ ) with root ρ be the least resolved tree for ( (cid:126)G, σ ). The support leaves S ρ arechildren of ρ by definition. By Lemma 7, the connected components of ( (cid:126)G − S ρ , σ | . ) are exactlythe BMGs (cid:126)G ( T ( v ) , σ | . ) with v ∈ child T ( ρ ) \ S ρ . Moreover, by Lemma 3, the subtrees T ( v ) with v ∈ child T ( ρ ) \ S ρ are exactly the unique LRTs for these BMGs.In order to use this property as a means of constructing the LRT in a recursive manner, we needto identify the support leaves of the root S ρ directly from the 2-BMG ( (cid:126)G, σ ) without constructingthe LRT first. To this end, we consider the set of umbrella vertices U ( (cid:126)G, σ ) comprising all vertices x for which N ( x ) consists of all vertices of V ( (cid:126)G ) that have the color distinct from σ ( x ). Definition 7 (Umbrella Vertices) . For an arbitrary 2-colored graph ( (cid:126)G, σ ) , the set U ( (cid:126)G, σ ) := (cid:110) x ∈ V ( (cid:126)G ) | y ∈ N ( x ) if σ ( y ) (cid:54) = σ ( x ) and y ∈ V ( (cid:126)G ) (cid:111) is the set umbrella vertices of ( (cid:126)G, σ ) . The intuition behind this definition is that every support leaf of the root of the LRT of a 2-BMGmust have all differently colored vertices as out-neighbors, i.e., they are umbrella vertices. We nowdefine “support sets” of graphs as particular subsets of umbrella vertices. As we shall see later,support sets are closely related to support vertices in S ρ . Definition 8 (Support Set of ( (cid:126)G, σ )) . Let ( (cid:126)G, σ ) be a 2-colored graph. A support set S := S ( (cid:126)G, σ ) of ( (cid:126)G, σ ) is a maximal subset S ⊆ U ( (cid:126)G, σ ) of umbrella vertices such that x ∈ S implies N − ( x ) ⊆ S . Lemma 8.
Every 2-colored graph ( (cid:126)G, σ ) has a unique support set S ( (cid:126)G, σ ) .Proof. Assume, for contradiction, that ( (cid:126)G, σ ) has (at least) two distinct support sets
S, S (cid:48) ⊆ U ( (cid:126)G, σ ). Clearly neither of them can be a subset of the other, since supports sets are maximal.We have N − ( x ) ⊆ S for all x ∈ S and and N − ( x (cid:48) ) ⊆ S (cid:48) for all x (cid:48) ∈ S (cid:48) , which implies that N − ( z ) ⊆ S ∪ S (cid:48) for all z ∈ S ∪ S (cid:48) . Together with the fact that S , S (cid:48) , and thus S ∪ S (cid:48) , are all subsetsof U ( (cid:126)G, σ ), this contradicts the maximality of both S and S (cid:48) .For the construction of the support set S := S ( (cid:126)G, σ ), we consider the following sequence of sets,defined recursively by S ( k ) := { x ∈ S ( k − | N − ( x ) ⊆ S ( k − } for k ≥ S (0) = U ( (cid:126)G, σ ) . (2)By construction S ( k +1) ⊆ S ( k ) . Furthermore, there is a k < | V ( (cid:126)G ) | such that S ( k +1) = S ( k ) . Nextwe show that in a 2-BMG, S is obtained in a single iteration. Lemma 9. If ( (cid:126)G, σ ) is a 2-BMG, then S = S (1) .Proof. Let ( (cid:126)G = (
V, E ) , σ ) be a 2-BMG and U = U ( (cid:126)G, σ ). Assume for contradiction that S (cid:54) = S (1) ,and thus S (2) (cid:40) S (1) . We will show that this implies the existence of a forbidden F2-graph. Byassumption, there is a vertex x ∈ S (1) \ S (2) . Thus, there must be a vertex y ∈ N − ( x ) (and thus( y , x ) ∈ E ) with σ ( y ) (cid:54) = σ ( x ) such that y / ∈ S (1) . However, by definition, y ∈ N − ( x ) and x ∈ S (1) implies y ∈ U . Now, it follows from y ∈ U \ S (1) that there is a vertex x ∈ N − ( y ) with σ ( x ) = σ ( x ) (cid:54) = σ ( y ) such that x / ∈ U . The latter together with x ∈ S (1) ⊆ U implies x (cid:54) = x .In particular, since x / ∈ U , the vertex x does not have an out-arc to every differently colored vertex,thus there must be a vertex y with σ ( y ) = σ ( y ) such that ( x , y ) / ∈ E . Since x ∈ N − ( y ), wehave ( x , y ) ∈ E and y (cid:54) = y . Finally, x ∈ U and σ ( y ) = σ ( y ) (cid:54) = σ ( x ) implies that ( x , y ) ∈ E .In summary, we have four distinct vertices x , x , y , y with σ ( x ) = σ ( x ) (cid:54) = σ ( y ) = σ ( y ) and(non-)arcs ( x , y ) , ( y , x ) , ( x , y ) ∈ E and ( x , y ) / ∈ E , and hence an induced F2-graph in ( (cid:126)G, σ ).By Thm. 2, we can conclude that ( (cid:126)G, σ ) is not a BMG; a contradiction.In general, S = S (0) = U ( (cid:126)G, σ ) is not satisfied. To see this consider the BMG ( (cid:126)G, σ ) that isexplained by the triple x y | x with σ ( x ) = σ ( x ) (cid:54) = σ ( y ). One easily verifies that U ( (cid:126)G, σ ) = { x , x } but S = { x } . Theorem 4.
Let ( T, σ ) be the least resolved tree of a 2-BMG ( (cid:126)G, σ ) . Then, the set of supportleaves S ρ of the root ρ equals the support set S of ( (cid:126)G, σ ) . In particular S (cid:54) = ∅ if and only if ( (cid:126)G, σ ) is connected. roof. Let (
T, σ ) be the LRT of a 2-BMG ( (cid:126)G = (
V, E ) , σ ). We set U := U ( (cid:126)G, σ ) and note first that S = S (1) by Lemma 9.First, suppose that ( (cid:126)G, σ ) is not connected. Then it immediately follows from Prop. 2 that σ ( L ( T ( v ))) = σ ( L ( T )) and thus | σ ( L ( T ( v ))) | > v ∈ child T ( ρ ). The latter together withCor. 1 implies that any child of ρ must be an inner vertex in T . Hence, S ρ = ∅ . On the other hand,since ( (cid:126)G, σ ) is not connected, each of its connected components is a 2-BMG (cf. Prop. 1), and thus,contains both colors. Therefore, for each vertex x in (cid:126)G , we can find a vertex y with σ ( x ) (cid:54) = σ ( y )such that ( x, y ) , ( y, x ) / ∈ E , and thus x / ∈ S . Since this is true for any vertex in (cid:126)G , we can conclude S = ∅ = S ρ .Now, suppose that ( (cid:126)G, σ ) is connected. By Cor. 2, we have S ρ (cid:54) = ∅ . We first show S ρ ⊆ S .Let x ∈ S ρ . By definition, x satisfies lca T ( x, y ) = ρ and therefore ( x, y ) ∈ E for all y ∈ L ( T ) with σ ( y ) (cid:54) = σ ( x ), i.e., x has an out-arc to every differently colored vertex in (cid:126)G . By definition, we thushave x ∈ U . Now assume for contradiction that x / ∈ S = S (1) = { z ∈ U | N − ( z ) ⊆ U } . The latterimplies that there exists a vertex y ∈ N − ( x ) such that y / ∈ U . In particular, ( y, x ) ∈ E . Since y / ∈ U , there is some vertex x (cid:48) with σ ( x (cid:48) ) = σ ( x ) such that ( y, x (cid:48) ) / ∈ E . Together this implies that xy | x (cid:48) is an informative triple. By Lemma 2, we obtain lca T ( x, y ) ≺ T lca T ( x, x (cid:48) ) = lca T ( x (cid:48) , y ) (cid:22) T ρ ;a contradiction to the assumption that x is a support leaf of ρ . Thus x ∈ S .Next, we show by contraposition that S ⊆ S ρ . To this end, suppose that x is not a supportleaf of ρ , i.e. x / ∈ S ρ . Hence, there is an inner vertex v ∈ child T ( ρ ) ∩ V ( T ) such that x ≺ T v . ByCor. 1, we conclude that | σ ( L ( T ( v ))) | = 2, i.e., the subtree T ( v ) contains both colors. We nowdistinguish two cases: (i) there is a leaf y (cid:48) ∈ L ( T ) \ L ( T ( v )) with σ ( y (cid:48) ) (cid:54) = σ ( x ), and (ii) there is noleaf y (cid:48) ∈ L ( T ) \ L ( T ( v )) with σ ( y (cid:48) ) (cid:54) = σ ( x ). Case(i):
Since T ( v ) contains both colors, there is a leaf y ∈ L ( T ( v )), with y (cid:54) = y (cid:48) and σ ( y ) = σ ( y (cid:48) ) (cid:54) = σ ( x ). Since, by construction, we have lca T ( x, y ) (cid:22) T v ≺ T ρ = lca T ( x, y (cid:48) ), it follows( x, y (cid:48) ) / ∈ E . Together with σ ( x ) (cid:54) = σ ( y (cid:48) ), this immediately implies x / ∈ U . From S (2) ⊆ S (1) ⊆ U ,we conclude x / ∈ S (1) = S . Case(ii):
Suppose that there is no leaf y (cid:48) ∈ L ( T ) \ L ( T ( v )) with σ ( y (cid:48) ) (cid:54) = σ ( x ). We willcontinue by showing that there is a support leaf y of vertex v with σ ( y ) (cid:54) = σ ( x ). Assume, forcontradiction, that the latter is not the case. Since ( T, σ ) is least resolved, the inner edge ρv isnot redundant. Hence, by Lemma 1, there must be an arc ( a, b ) ∈ E such that lca T ( a, b ) = v and σ ( b ) ∈ σ ( L ( T ) \ L ( T ( v ))). Since there is no leaf y (cid:48) ∈ L ( T ) \ L ( T ( v )) with σ ( y (cid:48) ) (cid:54) = σ ( x ), we concludethat σ ( b ) = σ ( x ) and σ ( a ) (cid:54) = σ ( x ). Clearly, it holds a, b ∈ L ( T ( v )). Now consider an arbitrary a (cid:48) ∈ L ( T ( v )) with σ ( a (cid:48) ) (cid:54) = σ ( x ). Since, by assumption, every such a (cid:48) is not a support leaf of v ,there must be an inner vertex w ∈ child T ( v ) ( v ) with a (cid:48) ≺ T w . By Cor. 1 and since w ≺ T v ≺ T ρ ,we conclude that | σ ( L ( T ( w ))) | = 2, i.e., the subtree T ( w ) contains both colors. Thus there is some b (cid:48) with σ ( b (cid:48) ) = σ ( x ) and lca T ( a (cid:48) , b (cid:48) ) (cid:22) T w ≺ T v . Since a (cid:48) was chosen arbitrarily, we conclude thatthere cannot be an arc ( a, b ) ∈ E such that lca T ( a, b ) = v ; a contradiction. It follows that there is asupport leaf y of vertex v with σ ( y ) (cid:54) = σ ( x ). Hence, lca T ( x, y ) = v (cid:22) T lca T ( x (cid:48)(cid:48) , y ) for all x (cid:48)(cid:48) ∈ L ( T )with σ ( x (cid:48)(cid:48) ) = σ ( x ), and thus ( y, x ) ∈ E and y ∈ N − ( x ). Since S ρ (cid:54) = ∅ and σ ( y ) / ∈ σ ( L ( T ) \ L ( T ( v ))),there must be a leaf x (cid:48) ∈ S ρ with σ ( x (cid:48) ) = σ ( x ). The fact that lca T ( x, y ) = v ≺ T ρ = lca T ( x (cid:48) , y )implies ( y, x (cid:48) ) / ∈ E . Therefore and since σ ( x (cid:48) ) (cid:54) = σ ( y ), it follows y / ∈ U . Together with y ∈ N − ( x ),we conclude that x / ∈ S (1) = S .In summary, we have shown S = S ρ for any BMG ( (cid:126)G, σ ). Finally, S = S ρ together with Cor. 2implies that S (cid:54) = ∅ if and only if ( (cid:126)G, σ ) is connected, which completes the proof. Thm. 4 provides not only a convenient necessary condition for connected 2-BMGs but also a fastway of determining the support set S = S ρ and thus also a fast recursive approach to construct theLRT for a 2-BMG. It is formalized in Alg. 1 and illustrated in Fig. 2. Lemma 10.
Let ( (cid:126)G, σ ) be a connected 2-BMG. Then Alg. 1 returns the least resolved tree for ( (cid:126)G, σ ) .Proof. Let (
T, σ ) be the (unique) least resolved tree of ( (cid:126)G, σ ) with root ρ . The latter is supplied toAlg. 1 to initialize the tree. By Thm. 4, Lemma 9 and since ( (cid:126)G, σ ) is connected, the set of supportleaves S ρ = S (2) = S (1) (cid:54) = ∅ for the root ρ is correctly identified in the top-level recursion of Alg. 1(Line 2-4) and attached to the root ρ (Line 8-9). According to Cor. 3, one can now proceed to lgorithm 1: LRT for connected 2-colored BMGs ( (cid:126)G, σ ). Input:
Connected properly 2-colored digraph ( (cid:126)G = (
L, E ) , σ ), vertex ρ Output:
LRT of ( (cid:126)G, σ ) if ( (cid:126)G, σ ) is a BMG Function
Build2ColLRT( (cid:126)G, σ, ρ ) U ← { x ∈ L | outdeg( x ) = | L | − | L [ σ ( x )] |} // umbrella vertices S (1) ← { x ∈ U | N − ( x ) ⊆ U } // all in-neighbors in U S (2) ← { x ∈ S (1) | N − ( x ) ⊆ S (1) } // all in-neighbors in S (1) if S (1) = ∅ or S (2) (cid:54) = S (1) then exit false else foreach x ∈ S (2) do add x as a child of ρ foreach connected component (cid:126)G v of (cid:126)G − S (2) do if | V ( (cid:126)G v ) | = 1 then exit false create vertex v T v ← Build2ColLRT( (cid:126)G v , σ | . , v ) connect the root v of T v as a child to ρ ( G, σ ) b b b b b a a a a b b a a a b a b b b b a a a b a b a b a U = { b , b } S (2) = { b , b } U = { a , b } S (2) = { a , b } U = { a , b } S (2) = { a , b } U = { a , a } S (2) = { a } U = { a , b } S (2) = { a , b } Figure 2:
Illustration of Alg. 1 with input ( (cid:126)G, σ ) (uppermost box). The boxes indicate the fiverecursion steps that are necessary to decompose ( (cid:126)G, σ ), and correspond to the five inner vertices of theLRT shown on the right. Note that, in the recursion step on ( (cid:126)G [ { a , a , b } ] , σ | . ), we have U (cid:54) = S (2) . recursively construct the LRTs for the connected components of ( (cid:126)G − S ρ , σ | . ), which is done inLine 10-15. By Lemma 7, these connected components ( (cid:126)G v , σ | . ) are exactly the BMGs (cid:126)G ( T ( v ) , σ | . )with v ∈ child T ( ρ ) \ { S ρ } (Line 14). In particular, therefore, we have V ( (cid:126)G v ) = L ( T ( v )). Since v / ∈ S ρ , i.e., v is an inner vertex, Cor. 1 and v ≺ T ρ imply | σ ( L ( T ( v ))) | >
1. Hence, in particular,the condition | V ( (cid:126)G v ) | > Theorem 5.
Given a connected properly 2-colored digraph ( (cid:126)G, σ ) as input, Alg. 1 returns a tree if and only if ( (cid:126)G, σ ) is a 2-colored BMG. In particular, T is the unique least resolved tree for ( (cid:126)G, σ ) .Proof. By Lemma 10, Alg. 1 returns the unique least resolved tree T if ( (cid:126)G, σ ) is a connected 2-colored BMG. To prove the converse, suppose that Alg. 1 returns a tree T given the connectedproperly 2-colored digraph ( (cid:126)G, σ ) as input. We will show that ( (cid:126)G, σ ) = (cid:126)G ( T, σ ), and thus ( (cid:126)G, σ ) isa BMG.It is easy to see that L ( T ) = V ( (cid:126)G ) must hold since, in each step of Alg. 1 every vertex is eitherattached to some inner vertex or passed down to a deeper-level recursion as part of some connectedcomponent. Therefore, every vertex of (cid:126)G eventually appears in the output. Thus σ ( L ( T )) = σ ( V ( (cid:126)G )) and | σ ( L ( T )) | = | σ ( V ( (cid:126)G )) | = 2. It remains to show E ( (cid:126)G ) = E ( (cid:126)G ( T, σ )).Note first that neither ( (cid:126)G, σ ) nor (cid:126)G ( T, σ ) contain arcs between vertices of the same color.Moreover, since Alg. 1 eventually returns a tree, we have S (1) = S (2) (cid:54) = ∅ in every recursion step.Throughout the remainder of the proof, we will write S (1) i and S (2) i for the sets S (1) and S (2) ofthe i th recursion step. Likewise, in every step, each connected component ( (cid:126)G v , σ | . ) computed inLine 10 must contain at least two vertices (cf. Line 11), and thus | σ ( V ( (cid:126)G v )) | = 2 because ( (cid:126)G, σ ) isproperly 2-colored.First, let S be the support set of (cid:126)G ( T, σ ) and x ∈ S be arbitrary. Note that the supportset is computed in the first iteration step of the algorithm as S = S (2)1 , hence S = S (2)1 (cid:54) = ∅ .By construction of T , x is attached as a leaf to ρ , i.e. lca T ( x, y ) = ρ . Consequently, ( x, y ) isan arc in (cid:126)G ( T, σ ) for all y ∈ V ( (cid:126)G ) with σ ( y ) (cid:54) = σ ( x ). By construction of S in Alg. 1, we have x ∈ S ⊆ U , i.e. x is an umbrella vertex in ( (cid:126)G, σ ) and has out-arcs to every vertex y ∈ V ( (cid:126)G ) with σ ( y ) (cid:54) = σ ( x ). Hence, all arcs of the form ( x, y ) with x ∈ S and σ ( x ) (cid:54) = σ ( y ) exist both in ( (cid:126)G, σ )and in (cid:126)G ( T, σ ). The latter property is in particular satisfied for all vertices in S and hence, all arcsbetween differently colored elements in S exist both in ( (cid:126)G, σ ) and in (cid:126)G ( T, σ ). Now consider anarbitrary vertex y ∈ V ( (cid:126)G ) \ S . Clearly, all in-neighbors in ( (cid:126)G, σ ) of the elements in S = S (2)1 mustbe contained in S , as a consequence of the condition S (1)1 = S (2)1 (cf. Line 5) and the construction of S (1)1 and S (2)1 . Hence, y / ∈ S and x ∈ S implies that ( y, x ) is not an arc in ( (cid:126)G, σ ). Moreover, y / ∈ S also implies that y is part of some connected component ( (cid:126)G v , σ | . ) of ( (cid:126)G − S, σ | . ). Therefore, andbecause Alg. 1 returns T , we must have y ∈ V ( (cid:126)G v ) = L ( T ( v )) for some inner vertex v ∈ child T ( ρ ).As argued above, ( (cid:126)G v , σ | . ) and thus also the subtree T ( v ) contain both colors. Together withObs. 3 and x / ∈ L ( T ( v )), this implies that (cid:126)G ( T, σ ) does not contain the arc ( y, x ). By the samearguments, there is no arc ( y, x (cid:48) ) in (cid:126)G ( T, σ ) such that the vertex x (cid:48) is contained in a differentconnected component ( (cid:126)G v (cid:48) , σ | . ) (cid:54) = ( (cid:126)G v , σ | . ) of ( (cid:126)G − S, σ | . ) than y . Since x ∈ S and y / ∈ S werechosen arbitrarily, we conclude that (i) any arc incident to some vertex in S exists in ( (cid:126)G, σ ) if andonly if it exists in (cid:126)G ( T, σ ), and (ii) (cid:126)G ( T, σ ) contains no arcs between distinct connected componentsof ( (cid:126)G − S, σ | . ). Hence, it remains to consider the arcs within a connected component ( (cid:126)G v , σ | . ) of( (cid:126)G − S, σ | . ).Alg. 1 recurses on each such connected component ( (cid:126)G v , σ | . ) using a newly created vertex v ∈ child T ( ρ ) to initialize the tree T ( v ). By Lemma 3, it clearly holds that, for any x, y ∈ L ( T ( v )) = V ( (cid:126)G v ), ( x, y ) is an arc in (cid:126)G ( T, σ ) if and only it is an arc in (cid:126)G ( T ( v ) , σ ). Thus, it suffices to consideronly the subtree T ( v ). Now, we can apply the same arguments as in the previous recursion step toconclude that all arcs incident to the support set S (2)2 constructed in the current recursion step arethe same in ( (cid:126)G, σ ) and (cid:126)G ( T, σ ) and that neither ( (cid:126)G, σ ) nor (cid:126)G ( T, σ ) contain arcs between distinctconnected components of ( (cid:126)G v − S (2)2 , σ | . ). Hence, it suffices to consider the connected componentsof ( (cid:126)G v − S (2)2 , σ | . ). Repeated application of this argumentation results in a chain of connectedcomponents that are contained in each other. Since Alg. 1 finally returns a tree, this chain is finite,say with a last element ( (cid:126)G w − S (2) k , σ | . ), and thus S (2) k = V ( (cid:126)G w ). In particular, therefore, everyvertex in V ( (cid:126)G ) is contained in the support set of some recursion step.In summary, we have shown that (cid:126)G ( T, σ ) = ( (cid:126)G, σ ). Hence, ( (cid:126)G, σ ) is a connected 2-BMG and,by Lemma 10, T is the unique least resolved tree of ( (cid:126)G, σ ).The construction in Lines 2-4 in Alg. 1 naturally produces two cases, U = S (1) = S (2) and S (2) ⊆ S (1) (cid:40) U . The following result shows that the latter case implies that the correspondinginterior node in the LRT has only a single non-leaf descendant: emma 11. Let ( (cid:126)G, σ ) be a 2-BMG and S ρ the support leaves of the root ρ of its LRT ( T, σ ) . If W := U ( (cid:126)G, σ ) \ S ρ (cid:54) = ∅ , then the following statements are true:1. S ρ (cid:54) = ∅ , (cid:126)G is connected, and (cid:126)G − S ρ is connected.2. All vertices in U ( (cid:126)G, σ ) = S ρ ∪· W have the same color,3. The set of support leaves S v of the unique inner vertex child v of ρ contains vertices of bothcolors, and4. W (cid:40) S v .Proof. First recall that, by Thm. 4 and the definition of the support set S of ( (cid:126)G, σ ), we have S ρ = S ⊆ U ( (cid:126)G, σ ), and thus U ( (cid:126)G, σ ) = S ρ ∪· W . Moreover, by Lemma 7, the connected componentsof ( (cid:126)G − S ρ , σ | . ) are exactly the BMGs (cid:126)G ( T ( v ) , σ | . ) with v ∈ child( ρ ) \ S ρ . The vertices v ∈ child( ρ ) \ S ρ are all inner vertices of T since, by definition, the support leaves S ρ are exactly the children of ρ that are leaves. Together with the contraposition of Lemma 5 this implies that T ( v ) contains bothcolors. Statement 1:
Let x ∈ W , which exists due to the assumption W := U ( (cid:126)G, σ ) \ S ρ (cid:54) = ∅ . Since x / ∈ S ρ , it must be part of some connected component of ( (cid:126)G − S ρ , σ | . ), say (cid:126)G ( T ( v ) , σ | . ) for some v ∈ child T ( ρ ) \ S ρ . Now assume, for contradiction, that (cid:126)G − S ρ consists of more than one connectedcomponent. By Lemmas 7 and 5, there is a vertex v (cid:48) ∈ child T ( ρ ) \ S ρ such that v (cid:54) = v (cid:48) and bothsubtrees T ( v ) and T ( v (cid:48) ) contain both colors. Hence, there are distinct y ∈ L ( T ( v )) and y (cid:48) ∈ L ( T ( v (cid:48) )) with σ ( y ) = σ ( y (cid:48) ) (cid:54) = σ ( x ). Together with x ∈ L ( T ( v )), we therefore have lca T ( x, y ) (cid:22) T v ≺ T ρ = lca T ( x, y (cid:48) ), which implies ( x, y (cid:48) ) / ∈ E ( (cid:126)G ). However, x ∈ W ⊆ U ( (cid:126)G, σ ) and σ ( y (cid:48) ) (cid:54) = σ ( x )imply ( x, y (cid:48) ) ∈ E ( (cid:126)G ); a contradiction. Hence, we conclude that (cid:126)G − S ρ has exactly one connectedcomponent, and thus ρ has a single inner vertex child v . Since T is phylogenetic, the latter impliesthat ρ must be incident to at least one leaf, i.e. S ρ (cid:54) = ∅ . Together with Thm. 4 this in turn impliesthat (cid:126)G is connected. In summary, Statement 1 is true. Statement 2:
Let x ∈ W as in the proof of Statement 1. By arguments analogous to those usedfor Statement 1, we conclude that σ ( x ) = σ ( y ) for every y ∈ S ρ , since otherwise we would obtain( x, y ) / ∈ E ( (cid:126)G ), and thus a contradiction to x ∈ U ( (cid:126)G, σ ). Since x ∈ W was chosen arbitrarily and S ρ is non-empty, we immediately obtain that all vertices in U ( (cid:126)G, σ ) = S ρ ∪· W have the same color,i.e., Statement 2 is true. Statement 3:
Now consider the single inner vertex child v of ρ , and its set of support leaves S v ,which must be non-empty by Lemma 6. Note that W must be entirely contained in L ( T ( v ))and recall that all vertices in S ρ ∪· W are of the same color (cf. Statement 2). First suppose, forcontradiction, that S v only contains vertices of the opposite color as the vertices in S ρ ∪· W . Thisimmediately implies S v ∩ W = ∅ , thus every vertex x ∈ W must be located in a subtree T ( w ) ofsome inner vertex child w of v . Again by contraposition of Lemma 5, every such T ( w ) containsboth colors. However, this contradicts ( x, y ) ∈ E ( (cid:126)G ) for every y ∈ S v , which must hold as aconsequence of x ∈ W ⊂ U ( (cid:126)G, σ ) and σ ( y ) (cid:54) = σ ( x ). Next suppose, for contradiction, that S v onlycontains vertices of the same color as the vertices in S ρ ∪· W . In this case, we obtain that the edge ρv is redundant w.r.t. ( (cid:126)G, σ ). To see this, consider an arc ( x, y ) ∈ E ( (cid:126)G ) such that lca T ( x, y ) = v .Clearly, x must be directly incident to v , since otherwise the subtree below v to which x belongswould contain both colors, and thus contradict ( x, y ) ∈ E ( (cid:126)G ). In other words, every such vertex x is a support leaf of v , thus σ ( x ) = σ ( S v ) = σ ( S ρ ) and σ ( y ) (cid:54) = σ ( S ρ ). In particular, there exists noarc ( x, y ) ∈ E ( (cid:126)G ) such that lca T ( x, y ) = v and σ ( y ) ∈ σ ( L ( T ) \ L ( T ( v ))) = σ ( S ρ ) and therefore,by Lemma 1, the inner edge ρv is redundant. However, this contradicts the fact that T is leastresolved. In summary, only the case in which S v (cid:54) = ∅ contains vertices of both colors is possible,and thus Statement 3 is true. Statement 4:
First, recall from the proof of Statement 3 that W ⊆ L ( T ( v )) for the single innervertex child v of ρ . In order to see that W ⊆ S v , assume for contradiction that this is not thecase. By similar arguments as used for showing Statement 3, this implies that some x ∈ W lies in a2-colored subtree T ( w ) for some w ∈ child T ( v ) \ S v . Together with the above established fact that S v contains both colors, this contradicts x ∈ U ( (cid:126)G, σ ). Finally, W (cid:54) = S v is a consequence of the factthat S v contains both colors (Statement 3) but W ⊆ S ρ ∪· W contains only one color (Statement2). We now use this result to consider the performance of Alg. 1. ..... ... WU S ρ ρ v ...... U S ρ ρ v v'a bb' ...... U S ρ ρ v A B C S v W Figure 3:
Illustration of Lemma 11. (A) The (local) situation if W = U \ S ρ (cid:54) = ∅ as implied byLemma 11. In particular, ρ only as a single inner vertex child v , all vertices in U = S ρ ∪· W have thesame color, S v contains vertices of both colors, and W (cid:40) S v . (B) There cannot be a second inner vertexchild v (cid:48) , since then none of the vertices except those in S ρ can be umbrella vertices, e.g. ( a, b ) is notan arc in the graph explained by the tree in (B). Hence, this situation is not possible for W (cid:54) = ∅ . (C)If S v does not contain vertices of both colors, then the edge ρv is redundant in the tree, contradictingthat ( T, σ ) in Lemma 11 is the LRT.
Lemma 12.
Alg. 1 can be implemented to run in O ( | E | log | V | ) time for a connected input graph.Proof. Since (cid:126)G is connected by assumption, we have | V | ∈ O ( | E | ). Starting from ( (cid:126)G, σ ), the list ofout-degrees can be constructed in O ( | E | ). The initial umbrella set U is then obtained by listing thevertices with maximal out-degree in the color class. The initial set S (1) is constructed by checking,for each u ∈ U , the in-neighbors of u for membership in U in O ( | V | + | E | ) operations. Then S (2) is obtained in the same manner from S (1) , requiring O ( | V | + | E | ) operations. The initial umbrellaset U and the sets S (1) and S (2) thus can be constructed in linear time. In each recursive callof Build2ColLRT , at least one leaf is split off, hence the recursion depth is | V | − U , theirremoval does not affect the out-neighborhood for any x ∈ V ( (cid:126)G − U ) ⊆ V ( (cid:126)G − S (2) ), and hence,outdeg( x ) does not require updates. The in-neighborhoods N − ( x ) can be updated by removing thearcs between (cid:126)G − S (2) and S (2) as a consequence of Lemma 7 and Thm. 4. Since every arc appearsexactly once in the removal, the total effort for these updates is O ( | E | ).We continue by showing that every vertex needs to be considered as an umbrella vertex atmost twice, and that the total effort of constructing all sets S (1) and S (2) is O ( | E | ), given thatthe umbrella vertices U can be obtained efficiently, which we discuss afterwards. To this end, wedistinguish, for each of the single recursion steps, two cases: S (1) = U and S (1) (cid:40) U . First if S (1) = U , and thus also S (2) = S (1) = U , we consider each in-arc of x ∈ U . Since these vertices andtheir corresponding arcs are removed when constructing (cid:126)G − S (2) , they are not considered againin a deeper recursion step. In the second case, we have S (1) (cid:40) U , which together with S (2) = S (1) implies W := U \ S (2) (cid:54) = ∅ , and only the vertices in U \ W are removed. However, Lemma 11guarantees that, for a 2-BMG as input graph, the vertices in W appear as support leaves in thenext step and thus appear in the update of U , S (1) , and S (2) no more than a second time. In orderto use the properties in Lemma 11 for the general case (i.e. ( (cid:126)G, σ ) is not necessarily a BMG), wecan, whenever W (cid:54) = ∅ , (i) check that (cid:126)G − S (2) only has a single connected component (cid:126)G v , and (ii)pass down the set W to the recursion step on (cid:126)G v in which the condition W (cid:40) S (2) is checked. Ifany of these checks fails, we can exit false. This way, we ensure that every vertex appears at mosttwo times as an umbrella vertex in the general case. To construct S (1) from U , we have to scan thein-neighborhood N − ( x ) of each vertex x ∈ U and check whether N − ( x ) ⊂ U . We repeat this stepto construct S (2) from S (1) . Membership in U and S (1) , resp., can be checked in constant time (e.g.by marking the vertices in the current set U ). Since we have to consider each vertex, and hence,each in-neighborhood at most twice, all sets S (1) and S (2) can be obtained with a total effort of O ( | E | ).It remains to show that the input graph can be decomposed efficiently in such a way that theconnectivity information is maintained and the candidates for umbrella vertices in each componentare updated. The connected components (cid:126)G v can be obtained by using the dynamic data struc-ture described in [6], often called HDT data structure. It maintains a maximal spanning forestrepresenting the underlying undirected graph with edge set (cid:101) E = { xy | ( x, y ) ∈ E or ( y, x ) ∈ E } ,and allows deletion of all | (cid:101) E | ∈ O ( | E | ) edges with amortized cost O (log | V | ) per edge deletion. he explicit traversal of the connected components to compute U can be avoided as follows: Sinceoutdeg( x ) does not require updates, we can maintain a doubly-linked list of vertices x for eachcolor i ∈ { , } , and each value of outdeg( x ) where σ ( x ) = i . In order to be able to access thehighest value of the out-degrees, we maintain these values together with pointer to the respectivedoubly-linked list in balanced binary search trees (BST), one for each color and each connectedcomponent. The BSTs for the two colors are computed first for ( (cid:126)G, σ ) in O ( | V | log( | V | )) timeand afterwards updated to fit with the out-degree of the currently considered component (cid:126)G v . Toupdate these lists and BSTs for (cid:126)G v , observe first that (cid:126)G v can be obtained from G by stepwisedeletion of single arcs, i.e. edges in the HDT data structure representing the underlying undirectedversions. We update, resp., construct the pair of BSTs (one for each color) for each connectedcomponent as follows: Since a single arc deletion splits a connected component (cid:126)G (cid:48) into at mosttwo connected components (cid:126)G , and (cid:126)G , we can apply the well-known technique of traversing thesmaller component [14]. The size of each connected component can be queried in O (1) time inthe HDT data structure. Suppose w.l.o.g. that | V ( (cid:126)G ) | ≤ | V ( (cid:126)G ) | . We construct a new pair ofBSTs for (cid:126)G , and delete the vertices V ( (cid:126)G ) and the respective degrees from the two original BSTsfor (cid:126)G , which then become the BSTs for (cid:126)G . More precisely, we delete each vertex x ∈ V ( (cid:126)G ) inthe respective list corresponding to outdeg( x ), and if the length of this list drops to zero, we alsoremove the corresponding out-degree in the BST. Likewise, we insert the out-degree of x and anempty doubly-linked list into the newly-created BST for (cid:126)G , if it is not yet present, and append x to this list. Note that the number of out-degree deletions and insertions does not exceed | V ( (cid:126)G ) | .Due to the technique of traversing the smaller component, every vertex is deleted and inserted atmost (cid:98) log | V |(cid:99) times. Therefore, we obtain an overall complexity of O ( | V | log | V | ) for the mainte-nance of the BSTs where the additional log-factor originates from rebalancing the BSTs whenevernecessary.In each recursion step, the set U can now be obtained by listing (at most) the vertices withthe maximal out-degree for each of the two colors. Finding the two out-degrees and correspondinglists in the BSTs requires O (log | V | ) in each step, and thus O ( | V | log | V | ) in total. In order todetermine whether these candidates x are actually umbrella vertices, we have to check whetheroutdeg( x ) = | V ( G v ) | − | V ( G v )[ σ ( x )] | . The HDT data structure allows constant-time query of thesize of a given connected component, since this information gets updated during the maintenanceof the spanning forest. By the same means, we can keep track of the number of vertices of aspecific color in each connected components. Note that we only need to do this for one color r since | V ( G v )[ s ] | = | V ( G v ) | − | V ( G v )[ r ] | . This does not increase the overall effort for maintainingthe data structure since it happens alongside the update of | V ( G v ) | .In summary, the total effort is dominated by maintaining the connectedness information whiledeleting O ( | E | ) arcs, i.e., O ( | E | log | V | ) time.As a direct consequence of Thm. 4 the LRT of a disconnected graph (cid:126)G is obtained by connectingthe roots of the LRTs T v of the connected components G v to an additional root vertex, see also [5,Cor. 4]. Lemma 12 thus implies Theorem 6.
The LRT of a 2-BMG can be computed in O ( | V | + | E | log | V | ) .Proof. The connected components G i = ( V i , E i ) of G = ( V, E ) can be enumerated in O ( | V | + | E | )operations, e.g. using a breadth-first search on the underlying undirected graph. By Lemma 12, O ( | E i | log | V i | ) ≤ O ( | E i | log | V | ) operations are required for each G i . Hence, the total effort is O ( | V | + | E | + log | V | (cid:80) i | E i | ) = O ( | V | + | E | log | V | ).In order to illustrate the improved complexity for the construction of LRTs of 2-BMGs, weimplemented both the well-known triple-based approach, i.e., the application of BUILD [2] with theinformative triples defined in Eq. (1) as input, and the new approach of Alg. 1. As input, we used2-BMGs that where randomly generated as follows: First, we simulate random trees T recursively,starting from a single vertex, by attaching to a randomly chosen vertex v either a single leaf if v is an inner vertex of T or a pair of leaves if v was a leaf. The construction stops when the desirednumber of leaves is reached. Note that the resulting tree is phylogenetic by construction. Eachleaf is then colored by selecting at random one of the two colors. Finally, we compute the 2-BMG (cid:126)G ( T, σ ) from each of the simulated leaf-colored trees (
T, σ ).Both methods for the LRT computation were implemented in Python. Moreover, we note thatwe did not implement the sophisticated dynamic data structures used in the proof of Lemma 12,but a rather na¨ıve implementation of Alg. 1. Nevertheless, Fig. 4 shows a remarkable improvement − − − − t i m e [ s ] BUILD (inf. triples)Alg. 1 no. of leaves Figure 4:
Running time comparison of the general approach for constructing an LRT using
BUILD (blue) vs. Alg. 1 (green). For each number of leaves, 200 2-BMGs where generated as described in thetext. In the left panel, the median values are shown with logarithmic axes. The additional dotted lineindicates the median values of the size of the simulated BMGs, i.e. the number of arcs, scaled by afactor 10 − . We did not compute the LRTs with the first method for instances with more than 1000leaves because of the excessive computational cost. xx' yy' xx' yy' Figure 5:
The tree on the r.h.s. explains the hourglass graph on the l.h.s. of the running time when compared to the general O ( | V | | E | log | V | ) approach for (cid:96) -BMGs detailedin [5]. Empirically, we observe that the running time of Alg. 1 indeed scales nearly linearly withthe number of edges. Binary phylogenetic trees are of particular interest in practical applications. Not every 2-BMG canbe explained by a binary tree. The subclass of binary-explainable ( (cid:96) -)BMG are characterized amongall BMGs by the absence of single forbidden subgraph called hourglass [10, 11], illustrated in Fig. 5.In this section we briefly describe a modification of Alg. 1 that allows the efficient recognition ofbinary-explainable 2-BMGs.
Definition 9. An hourglass in a properly vertex-colored graph ( (cid:126)G, σ ) , denoted by [ xy (cid:38)(cid:37) x (cid:48) y (cid:48) ] , isa subgraph ( (cid:126)G [ Q ] , σ | Q ) induced by a set of four pairwise distinct vertices Q = { x, x (cid:48) , y, y (cid:48) } ⊆ V ( (cid:126)G ) such that (i) σ ( x ) = σ ( x (cid:48) ) (cid:54) = σ ( y ) = σ ( y (cid:48) ) , (ii) ( x, y ) , ( y, x ) and ( x (cid:48) y (cid:48) ) , ( y (cid:48) , x (cid:48) ) are bidirectional arcsin (cid:126)G , (iii) ( x, y (cid:48) ) , ( y, x (cid:48) ) ∈ E ( (cid:126)G ) , and (iv) ( y (cid:48) , x ) , ( x (cid:48) , y ) / ∈ E ( (cid:126)G ) . A graph ( (cid:126)G, σ ) is called hourglass-free if it does not contain an hourglass as an induced subgraph.We summarize Lemma 31 and Prop. 8 in [11] as
Proposition 3.
For every BMG ( (cid:126)G, σ ) , the following three statements are equivalent: . ( (cid:126)G, σ ) is binary-explainable.2. ( (cid:126)G, σ ) is hourglass-free.3. If ( T, σ ) is a tree explaining ( (cid:126)G, σ ) , then there is no vertex u ∈ V ( T ) with three distinctchildren v , v , and v and two distinct colors r and s satisfying(a) r ∈ σ ( L ( T ( v ))) , r, s ∈ σ ( L ( T ( v ))) , and s ∈ σ ( L ( T ( v ))) , and(b) s / ∈ σ ( L ( T ( v ))) , and r / ∈ σ ( L ( T ( v ))) . The following Lemma shows that the third condition in Prop. 3 can be translated to a muchsimpler statement in terms of the support leaves of its LRT.
Lemma 13.
A 2-BMG ( (cid:126)G, σ ) contains an induced hourglass if and only if its LRT ( T, σ ) containsan inner vertex u such that S u contains support vertices of both colors and V ( (cid:126)G ( T ( u )) − S u ) (cid:54) = ∅ .Proof. By Thm. 5, Alg. 1 returns the LRT (
T, σ ) for ( (cid:126)G, σ ) if and only if ( (cid:126)G, σ ) is a 2-BMG.Hence, we assume in the following that the latter is satisfied. As a consequence of Prop. 3 andthe fact that (
T, σ ) explains ( (cid:126)G, σ ), we know that ( (cid:126)G, σ ) is binary-explainable if and only if thereis no vertex u ∈ V ( T ) with three distinct children v , v , and v and two distinct colors r and s satisfying (a) r ∈ σ ( L ( T ( v ))), r, s ∈ σ ( L ( T ( v ))), and s ∈ σ ( L ( T ( v ))), and (b) s / ∈ σ ( L ( T ( v ))),and r / ∈ σ ( L ( T ( v ))).First, suppose that ( (cid:126)G, σ ) contains an hourglass, i.e., by Prop. 3 there is a vertex u ∈ V ( T )with distinct children v , v , and v and two distinct colors r and s satisfying (a) and (b). Since( (cid:126)G, σ ) is 2-colored and ( T, σ ) its LRT, Lemma 5 together with s / ∈ σ ( L ( T ( v ))) and r / ∈ σ ( L ( T ( v )))implies that v of color r and v of color s , respectively, are both leaves. In particular, therefore, weknow that v , v ∈ S u are support leaves. By Lemma 7 and since (cid:126)G ( T ( u ) , σ | . ) is also a BMG, theconnected components of ( (cid:126)G ( T ( u )) − S u , σ | . ) = ( (cid:126)G [ L ( T ( u ))] − S u , σ | . ) (cf. Lemma 3) are exactly theBMGs (cid:126)G ( T ( v ) , σ | . ) with v ∈ child( u ) \ S u . Together with the fact that v ∈ V ( T ) as a consequenceof L ( T ( v )) containing both colors r and s , this implies that ( (cid:126)G ( T ( u )) − S u , σ | . ) is not the emptygraph.Conversely, suppose there a vertex u ∈ V ( T ) such that S u contains support vertices v and v with distinct colors σ ( v ) (cid:54) = σ ( v ) and V ( (cid:126)G ( T ( u )) − S u ) (cid:54) = ∅ , i.e., u has a child v ∈ child( u ) \ S u that is not a support leaf and hence satisfies v ∈ V ( T ). Lemma 5 implies that L ( T ( v )) containsboth colors since v ∈ V ( T ). Hence, the three children v , v , and v of u satisfy conditions (a)and (b) of Prop. 3(3), and thus ( (cid:126)G, σ ) contains an induced hourglass. Corollary 4.
It can be checked in O ( | V | + | E | log | V | ) whether or not a properly 2-colored graph ( (cid:126)G, σ ) is a binary-explainable BMG.Proof. Recall that there is a one-to-one correspondence between the recursion step in Alg. 1 andthe inner vertices u ∈ V ( T ). As argued in the proof of Lemma 12, every vertex appears at mosttwice in an umbrella set U . Therefore, it can be checked in O ( | V | ) total time whether S = S (2) contains vertices of both colors. Since the vertex set of (cid:126)G u − S u is maintained in the dynamic graphHDT data structure, it can be checked in constant time for each u whether (cid:126)G u − S u is non-empty.The additional effort to check the condition of Lemma 13 is therefore only O ( | V | ). Hence, we stillrequire a total effort of O ( | V | + | E | log | V | ) (cf. Thm. 6).Cor. 4 improves the complexity for the decision whether a 2-BMG is binary-explainable ascompared to the O ( | V | log | V | )-time algorithm for (general) BMGs presented in [10]. We have shown here that 2-BMGs have a recursive structure that is reflected in certain inducedsubgraphs that correspond to subtrees of the LRT. The leaves connected directly to the root of agiven subtree play a special role as support vertices in the corresponding subgraph of the 2-BMG.Since the support vertices of the root can be identified efficiently in a given input graph, there isa recursive decomposition of ( (cid:126)G, σ ) that directly yields the LRT. With the help of a dynamic datastructure to maintain connectedness information [6], this provides an O ( | V | + | E | log | V | ) algorithmto recognize both 2-BMGs and binary explainable 2-BMGs and to construct the corresponding LRT.This provides a considerable speed-up compared to the previously known O ( | V || E | log | V | ) and ( | V | ) algorithms. Empirically, we observe a substantial speed-up even if simpler data structuresare used to implement Alg. 1.Both the theoretical insights and Alg. 1 itself have potential applications to the analysis ofgene families in computational biology. Real-life data necessarily contain noise, and thus likely willdeviate from perfect BMGs, naturally leading to graph editing problems for BMGs. Like manycombinatorial problems in phylogenetics, these are NP-complete [12] and hence require approxi-mation algorithms and heuristics. The support leaves introduced here provide an avenue to a newclass of heuristics, conceptually distinct from approaches that attempt to extract consistent subsetsof triples from R ( (cid:126)G, σ ). Acknowledgments
This work was supported in part by the Austrian Federal Ministries BMK and BMDW and theProvince of Upper Austria in the frame of the COMET Programme managed by FFG, and the
Deutsche Forschungsgemeinschaft . eferences [1] G. Abrams and J. K. Sklar. The graph menagerie: Abstract algebra and the mad veterinarian. Math. Mag. , 83:168–179, 2010. doi:10.4169/002557010X494814 .[2] A. Aho, Y. Sagiv, T. Szymanski, and J. Ullman. Inferring a tree from lowest common ancestorswith an application to the optimization of relational expressions.
SIAM J Comput , 10:405–421,1981. doi:10.1137/0210030 .[3] H. Cohn, R. Pemantle, and J. G. Propp. Generating a random sink-free orientation in quadratictime.
Electr. J. Comb. , 9:R10, 2002. doi:10.37236/1627 .[4] S. Das, P. Ghosh, S. Ghosh, and S. Sen. Oriented bipartite graphs and the Goldbach graph.Technical Report math.CO/1611.10259v6, arXiv, 2020.[5] M. Geiß, E. Ch´avez, M. Gonz´alez Laffitte, A. L´opez S´anchez, B. M. R. Stadler, D. I. Valdivia,M. Hellmuth, M. Hern´andez Rosales, and P. F. Stadler. Best match graphs.
J. Math. Biol. ,78:2015–2057, 2019. doi:10.1007/s00285-019-01332-9 .[6] J. Holm, K. de Lichtenberg, and M. Thorup. Poly-logarithmic deterministic fully-dynamicalgorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity.
J. ACM ,48:723–760, 2001. doi:10.1145/502090.502095 .[7] A. Korchmaros. Circles and paths in 2-colored best match graphs. Technical Reportmath.CO/2006.04100v1, arXiv, 2020.[8] A. Korchmaros. The structure of 2-colored best match graphs. Technical Reportmath.CO/2009.00447v2, arXiv, 2020.[9] D. Schaller, M. Geiß, E. Ch´avez, M. Gonz´alez Laffitte, A. L´opez S´anchez, B. M. R. Stadler,D. I. Valdivia, M. Hellmuth, M. Hern´andez Rosales, and P. F. Stadler. Best match graphs (corrigendum) . arxiv.org/1803.10989v4, 2020.[10] D. Schaller, M. Geiß, M. Hellmuth, and P. F. Stadler. Best match graphs with binary trees.2020. submitted; arXiv 2011.00511.[11] D. Schaller, M. Geiß, P. F. Stadler, and M. Hellmuth. Complete characterization of incorrectorthology assignments in best match graphs.
J. Math. Biol. , 2021. accepted; arXiv: 2006.02249.[12] D. Schaller, P. F. Stadler, and M. Hellmuth. Complexity of modification problems for bestmatch graphs.
Theor. Comp. Sci. , 2021. in revision; arxiv: 2006.16909.[13] C. Semple and M. Steel.
Phylogenetics . Oxford University Press, Oxford UK, 2003.[14] Y. Shiloach and S. Even. An on-line edge-deletion problem.
J. ACM , 28:1–4, 1981. doi:10.1145/322234.322235 ..