Complete Characterization of Incorrect Orthology Assignments in Best Match Graphs
David Schaller, Manuela Geiß, Peter F. Stadler, Marc Hellmuth
CComplete Characterization of Incorrect Orthology Assignments inBest Match Graphs
David Schaller , Manuela Geiß , Peter F. Stadler , and Marc Hellmuth Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany Bioinformatics Group, Department of Computer Science & Interdisciplinary Center for Bioinformatics, Universit¨at Leipzig,H¨artelstraße 16–18, D-04107 Leipzig, Germany. Software Competence Center Hagenberg GmbH, Softwarepark 21, A-4232 Hagenberg, Austria German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Competence Center for Scalable Data Servicesand Solutions Dresden-Leipzig, Leipzig Research Center for Civilization Diseases, and Centre for Biotechnology and Biomedicineat Leipzig University at Universit¨at Leipzig Institute for Theoretical Chemistry, University of Vienna, W¨ahringerstrasse 17, A-1090 Wien, Austria Facultad de Ciencias, Universidad National de Colombia, Sede Bogot´a, Colombia Santa Fe Insitute, 1399 Hyde Park Rd., Santa Fe NM 87501, USA School of Computing, University of Leeds, EC Stoner Building, Leeds LS2 9JT, UK [email protected] * corresponding author Abstract
Genome-scale orthology assignments are usually based on reciprocal best matches. In the absence of horizontalgene transfer (HGT), every pair of orthologs forms a reciprocal best match. Incorrect orthology assignments thereforeare always false positives in the Reciprocal Best Match Graph. We consider duplication/loss scenarios and charac-terize unambiguous false-positive ( u-fp ) orthology assignments, that is, edges in the Best Match Graphs (BMGs)that cannot correspond to orthologs for any gene tree that explains the BMG. We characterize u-fp edges in termsof subgraphs of the BMG and show that, given a BMG, there is a unique “augmented tree” that explains the BMGand identifies all u-fp edges in terms of overlapping sets of species in certain subtrees. The augmented tree can beconstructed as a refinement of the unique least resolved tree of the BMG in polynomial time. Removal of the u-fp edges from the reciprocal best matches results in a unique orthology assignment.
Keywords: orthology detection; best matches; unambiguous orthologs; colored graphs; cograph; tree reconcilia-tion; polynomial-time algorithm
Orthology is one of the key concepts in evolutionary biology: Two genes are orthologs if their last common an-cestor was a speciation event [8]. Distinguishing orthologs from paralogs (originating from gene duplications) orxenologs (i.e., genes that have undergone horizontal gene transfer) is of considerable practical importance for func-tional genome annotation and thus for a wide array of methods in bioinformatics and computational biology that relyon gene annotation data. Orthologous genes in different species are expected to have essentially the same biologi-cal and molecular functions. Paralogs and xenologs, in contrast, tend to have similar, but clearly distinct functions[9, 45, 51]. Most of the commonly used tools for large-scale orthology identification compute reciprocal best hits asa first step followed by some filtering and refinement steps to improve the results, see [10, 35, 44] for reviews and [1]for benchmarking results.Orthology identification has also received increasing attention from a mathematical perspective starting from theconcept of an evolutionary scenario comprising a gene tree T and a species tree S together with a reconciliation map µ from T to S . The map µ identifies the locations in the species tree at which evolutionary events, represented by thevertices of the gene tree, took place. In this contribution, we consider exclusively duplication/loss scenarios, i.e., weexplicitly exclude horizontal gene transfer.
Characterizations of reconciliation maps are given e.g. in [7, 15, 40, 50].While every gene tree can be reconciled with any species tree [16, 37], this is no longer true if event-labels areprescribed in the gene tree T [19, 24, 30]. a r X i v : . [ q - b i o . P E ] J un he orthology relation itself has been characterized as a cograph (i.e., graphs that do not contain induced paths P on four vertices) by Hellmuth et al. [22] based on earlier work by B¨ocker and Dress [2]. This line of research hasled to the idea of editing reciprocal best hit data to conform the require cograph structure [23]. Maybe surprisingly,the combinatorial structure of best matches has become a focus only very recently. We consider best matches as anevolutionary concept: A gene y in species s is a best match of a gene x from species r (cid:54) = s if s contains no gene y (cid:48) that is more closely related to x . In practice, best matches are approximated by sequence similarity. We refer to[46] and the references therein for a detailed account on how best matches can be extracted from sequence similaritydata. Best Match Graphs (BMGs) have several appealing properties: They have several alternative characterizationsproviding polynomial-time recognition algorithms and they are “explained” by a unique least resolved tree [12].These properties will be introduced formally in the next section and play an important role in our discussion. TheReciprocal Best Match Graphs (RBMGs) are the symmetric parts of BMGs and conceptually correspond to thereciprocal best hits used in orthology detection. In contrast to BMGs, RBMGs are much more difficult to handle andare not associated with unique trees [14].RBMGs in general are not cographs and thus contain incorrect orthology assignments associated with P s. Im-portantly, such incorrect assignments are always false positives and thus, correspond to edges in RBMGs that must bedeleted [13]. A P in an RBMG arises in particular as a consequence of complementary gene loss, i.e., the completeloss of different paralogous groups in disjoint lineages. Under certain circumstances, such false-positive orthologyassignments can be identified, in particular when there are also species in which both paralogs have survived [6]. Asubset of false-positive orthology assignments, the “good quartets”, can be identified unambiguously by consideringBMGs instead of RBMGs [14]. Their removal already leads to a substantial improvement in simulated data [13].Here, we consider the false-positive orthology assignments in more detail, using BMGs and their explaining treesas the formal framework. Section 3.2 introduces the notion of unambiguous false-positive edges , that is, reciprocalbest matches that cannot be orthologs w.r.t. to any gene tree explaining the BMG. For BMGs that can be explainedby fully resolved, i.e., binary gene trees, Thm. 3.9 shows that unambiguous false-positive are already determined bythis particular gene tree. Furthermore, we will see in Prop. 3.15 that these false positives are the middle edges of agood quartet or one of the outer edges of an ugly quartet [13, 14]. This leaves unambiguous false-positive edges inBMGs that have no explanation by a binary tree, i.e., that require hard polytomies in the evolutionary scenario. InSection 3.5 we show that these false-positives are associated with another class of induced subgraphs of the BMG,which we termed hourglasses . Thm. 4.5 then provides an additional set of false-positive edges that are embedded inchains of hourglasses in a certain manner.The gene tree T “displays” two trees that play a key role for our purposes: the discriminating cotree of theorthology cograph and the least resolved tree of the BMG. While the first can be endowed with an unambiguousevent labeling, this not true for the latter [13]. The least resolved tree may contain polytomic vertices arising by themerging of speciation and duplication events. This leads us to the construction of a unique augmented tree in Section4.3, which can be endowed with an unambiguous event labeling and thus, induces a unique cograph. Thm. 4.19 statesthat a reciprocal best match is an unambiguous false-positive if and only if it is not contained in the cograph inducedby augmented tree. Algorithm 1 furthermore computes the augmented tree in polynomial time from the least resolvedtree of the BMG, resulting a polynomial-time procedure to identify all unambiguous false-positive edges in a BMG.In simulations, we find that at least three quarters of all false positives fall into this class. The remaining cases are notrecognizable from best matches alone and correspond to complementary losses without surviving witnesses. We consider directed graphs (cid:126) G = ( V , E ) , for brevity just called graphs throughout, with arc set E ⊆ V × V \ { ( v , v ) | v ∈ V } . We say that xy is an edge in (cid:126) G if and only if both ( x , y ) ∈ E ( (cid:126) G ) and ( y , x ) ∈ E ( (cid:126) G ) . If all arcs of (cid:126) G in a graphform edges, we call (cid:126) G undirected . A graph H = ( W , F ) is a subgraph of G = ( V , E ) , in symbols H ⊆ G , if W ⊆ V and F ⊆ E . The underlying symmetric part of a directed graph (cid:126) G = ( V , E ) is the subgraph G = ( V , F ) that containsall edges of (cid:126) G . A subgraph H = ( W , F ) (of (cid:126) G ) is called induced , denoted by (cid:126) G [ W ] , if for all u , v ∈ W it holds that ( u , v ) ∈ E implies ( u , v ) ∈ F . In addition, we consider vertex-colored graphs ( (cid:126) G , σ ) with vertex-coloring σ : V → M into some set M of colors. A vertex-coloring is called proper if σ ( x ) (cid:54) = σ ( y ) for every arc ( x , y ) in (cid:126) G . We write σ ( W ) = { σ ( w ) | w ∈ W } for subsets W ⊆ V and σ | W to denote the restriction of the map σ to W ⊆ V . In particular, ( (cid:126) G [ W ] , σ | W ) is an induced vertex-colored subgraph of ( (cid:126) G , σ ) .A path (of length (cid:96) ) in a directed graph (cid:126) G or an undirected graph G is a subgraph induced by a nonempty sequenceof pairwise distinct vertices P ( x , x (cid:96) ) : = ( x , x ,..., x (cid:96) ) such that ( x i , x i + ) ∈ E ( (cid:126) G ) or x i x i + ∈ E ( G ) , resp., for 0 ≤ i ≤ (cid:96) −
1. We use the notation P ( x , x (cid:96) ) both for the sequence of vertices and the subgraph they induce.All trees T = ( V , E ) considered here are undirected , planted and phylogenetic , that is, they satisfy (i) the root 0 T has degree 1 and (ii) all inner vertices have degree deg T ( u ) ≥
3. We write L ( T ) for the leaves (not including 0 T )and V = V ( T ) \ ( L ( T ) ∪ { T } ) for the inner vertices (also not including 0 T ). To avoid trivial cases, we will always ssume | L ( T ) | ≥
2. An edge uv in T is an inner edge if u , v ∈ V ( T ) are inner vertices. The conventional root ρ T of T is the unique neighbor of 0 T . The main reason for using planted phylogenetic trees instead of modeling phylogenetictrees simply as rooted trees, which is the much more common practice in the field, is that we will often need to referto the time before the first branching event, i.e., the edge 0 T ρ T .We define the ancestor order on a given tree T as follows: if y is a vertex of the unique path connecting x withthe root 0 T , we write x (cid:22) T y , in which case y is called an ancestor of x and x is called a descendant of y . We use x ≺ T y for x (cid:22) T y and x (cid:54) = y . If x (cid:22) T y or y (cid:22) T x the vertices x and y are comparable and, otherwise, incomparable .If xy is an edge in T , such that y ≺ T x , then x is the parent of y and y the child of x . We denote by child T ( x ) the setof all children of x . It will be convenient for the discussion below to extend the ancestor relation (cid:22) T to the union ofthe edge and vertex sets of T . More precisely, for a vertex x ∈ V ( T ) and an edge e = uv ∈ E ( T ) with v ≺ T u we write x ≺ T e if and only if x (cid:22) T v and e ≺ T x if and only if u (cid:22) T x . For edges e = uv with v ≺ T u and f = ab with b ≺ T a in T we put e (cid:22) T f if and only if v (cid:22) T b .For a non-empty subset A ⊆ V ∪ E we define lca T ( A ) , the last common ancestor of A , to be the unique (cid:22) T -minimal vertex of T that is an ancestor of every vertex or edge in A . For simplicity we drop the brackets and writelca T ( x ,..., x k ) : = lca T ( { x ,..., x k } ) whenever we specify a set of vertices or edges explicitly.A vertex v ∈ V ( T ) is binary if deg T ( v ) =
3, i.e., if v has exactly two children. A tree is binary , if all vertices v ∈ V are binary. For v ∈ V ( T ) we denote by T ( v ) the subtree of T rooted in v . The set of clusters of a tree T is C ( T ) = { L ( T ( v )) | v ∈ V ( T ) } . It is well-known that C ( T ) uniquely determines T [42]. We say that a tree T isa refinement of some tree T (cid:48) if C ( T (cid:48) ) ⊆ C ( T ) . A tree T (cid:48) is displayed by a tree T , in symbols T (cid:48) ≤ T , if T (cid:48) can beobtained from a subtree of T by contraction of edges [43], where the contraction of an edge e = uv in a tree T = ( V , E ) refers to the removal of e and identification of u and v . It is easy to verify that every refinement T of T (cid:48) also displays T (cid:48) . However, the converse is not always true since L ( T (cid:48) ) (cid:40) L ( T ) and thus, C ( T (cid:48) ) (cid:54)⊆ C ( T ) may possible. We consider a pair T = ( V , E ) and S = ( W , F ) of planted phylogenetic trees together with a map σ : L ( T ) → L ( S ) .We interpret T as a gene tree and S as a species tree ; the map σ describes, for each gene x ∈ L ( T ) , in the genome ofwhich species σ ( x ) ∈ L ( S ) it resides. W.l.o.g. we assume that the “gene-species-association” σ is a surjective mapto avoid trivial cases. Since σ can be viewed as a coloring of the leaves of T , we call ( T , σ ) a leaf-colored tree . For s ∈ L ( S ) we write L [ s ] : = { x ∈ L ( T ) | σ ( x ) = s } . Definition 2.1.
Let ( T , σ ) be a leaf-colored tree. A leaf y ∈ L ( T ) is a best match of the leaf x ∈ L ( T ) if σ ( x ) (cid:54) = σ ( y ) and lca ( x , y ) (cid:22) T lca ( x , y (cid:48) ) holds for all leaves y (cid:48) from species σ ( y (cid:48) ) = σ ( y ) . The leaves x , y ∈ L ( T ) are reciprocal bestmatches if y is a best match for x and x is a best match for y. The graph (cid:126) G ( T , σ ) = ( V , E ) with vertex set V = L ( T ) , vertex coloring σ , and with arcs ( x , y ) ∈ E if and only if y is a best match of x w.r.t. ( T , σ ) is known as the (colored) Best Match Graph of ( T , σ ) [12]. The symmetric part G ( T , σ ) of (cid:126) G ( T , σ ) obtained by retaining the edges of (cid:126) G ( T , σ ) is the (colored) Reciprocal Best Match Graph [14].
Definition 2.2.
An arbitrary vertex-colored graph ( (cid:126) G , σ ) is a Best Match Graph (BMG) if there exists a leaf-coloredtree ( T , σ ) such that ( (cid:126) G , σ ) = (cid:126) G ( T , σ ) . In this case, we say that ( T , σ ) explains ( (cid:126) G , σ ) . An arbitrary undirectedvertex-colored graph ( G , σ ) is a Reciprocal Best Match Graph (RBMG) if it is the symmetric part of a BMG ( (cid:126) G , σ ) . For the symmetric part of the BMG ( (cid:126) G , σ ) , i.e., the RBMG ( G , σ ) , we have xy ∈ E ( G ) if and only if x and y arereciprocal best matches in ( T , σ ) . In this sense, ( T , σ ) also explains ( G , σ ) . We note, furthermore, that RBMGs arenot associated with unique least resolved tree [14].In the following we collect some useful properties of BMGs and RBMGs for later reference. Lemma 2.3. [14, Lemma 10] Let ( T , σ ) be a leaf-colored tree on L and let v ∈ V ( T ) . Then, for any two distinctcolors r , s ∈ σ ( L ( T ( v ))) , there is an edge xy in (cid:126) G ( T , σ ) with x ∈ L [ r ] ∩ L ( T ( v )) and y ∈ L [ s ] ∩ L ( T ( v )) . Lemma 2.4.
Let ( (cid:126) G , σ ) be a BMG explained by a tree ( T , σ ) . Moreover, let x , y ∈ L ( T ) with σ ( x ) (cid:54) = σ ( y ) andv x , v y ∈ child ( lca T ( x , y )) with x (cid:22) T v x and y (cid:22) T v y . Then, σ ( x ) / ∈ σ ( L ( T ( v y ))) and σ ( y ) / ∈ σ ( L ( T ( v x ))) if and onlyif xy is an edge in (cid:126) G.Proof.
By the definition of best matches, it holds that xy is an edge in (cid:126) G if and only if lca T ( x , y ) (cid:22) T lca T ( x , y (cid:48) ) for all y (cid:48) ∈ L ( T ) of color σ ( y ) and lca T ( x , y ) (cid:22) T lca T ( x (cid:48) , y ) for all x (cid:48) ∈ L ( T ) of color σ ( x ) . Clearly, lca T ( x , y ) (cid:22) T lca T ( x , y (cid:48) ) for all such y (cid:48) if and only if σ ( y ) / ∈ σ ( L ( T ( v x ))) , and lca T ( x , y ) (cid:22) T lca T ( x (cid:48) , y ) for all such x (cid:48) if and only if σ ( x ) / ∈ σ ( L ( T ( v y ))) . Definition 2.5.
Suppose that ( T , σ ) explains ( (cid:126) G , σ ) . Then we say that ( T , σ ) is least resolved (w.r.t. ( (cid:126) G , σ ) ) if no tree ( T (cid:48) , σ ) displayed by ( T , σ ) explains ( (cid:126) G , σ ) . Recall all trees in this contribution are planted, and thus least resolved trees (LRTs) are also considered as planted.Strictly speaking, this differs from the construction in [12–14], the additional (non-contractible) edge 0 T ρ T is a trivialdetail that does not affect the properties of LRTs. heorem 2.6. [12, Thm. 8 and Cor. 4] Every BMG ( (cid:126) G , σ ) is explained by a unique least resolved tree ( T ∗ , σ ) . Inparticular, every other tree ( T , σ ) explaining ( (cid:126) G , σ ) is a refinement of ( T ∗ , σ ) . The least resolved tree ( T ∗ , σ ) of aBMG ( (cid:126) G , σ ) can be constructed in polynomial time. Definition 2.7.
Let ( (cid:126) G , σ ) be a Best Match Graph. We say that a triple ab | b (cid:48) is informative for (cid:126) G if a, b and b (cid:48) arethree different vertices with σ ( a ) (cid:54) = σ ( b ) = σ ( b (cid:48) ) in (cid:126) G such that ( a , b ) ∈ E ( (cid:126) G ) and ( a , b (cid:48) ) / ∈ E ( (cid:126) G ) . Lemma 2.8.
Let ( (cid:126) G , σ ) be a BMG and ab | b (cid:48) an informative triple for (cid:126) G. Then, every tree T that explains ( (cid:126) G , σ ) displays the triple ab | b (cid:48) , i.e. lca T ( a , b ) ≺ T lca T ( a , b (cid:48) ) = lca T ( b , b (cid:48) ) .Proof. The definition of informative triples implies that ( a , b ) ∈ E ( (cid:126) G ) and ( a , b (cid:48) ) / ∈ E ( (cid:126) G ) . Using σ ( b ) = σ ( b (cid:48) ) andthe definition of best matches we immediately conclude lca T ( a , b ) ≺ T lca T ( a , b (cid:48) ) . Lemma 2.9.
Let ab | b (cid:48) and cb (cid:48) | b be informative triples for a BMG ( (cid:126) G , σ ) . Then every tree ( T , σ ) that explains ( (cid:126) G , σ ) contains two distinct children v , v ∈ child T ( lca T ( a , c )) such that a , b ≺ T v and b (cid:48) , c ≺ T v .Proof. Let ( T , σ ) be an arbitrary tree that explains ( (cid:126) G , σ ) . By Lemma 2.8, T displays the informative triples ab | b (cid:48) and cb (cid:48) | b . Thus we have lca T ( a , b ) ≺ T lca T ( a , b (cid:48) ) = lca T ( b , b (cid:48) ) and lca T ( c , b (cid:48) ) ≺ T lca T ( c , b ) = lca T ( b , b (cid:48) ) . In particular,lca T ( a , b (cid:48) ) = lca T ( b , b (cid:48) ) = lca T ( c , b ) = : u . Therefore, a (cid:22) T v and b (cid:48) (cid:22) T v for distinct v , v ∈ child T ( u ) . Sincelca T ( a , b ) ≺ T u , we have a , b ≺ T v and thus v is an inner node. Likewise, lca T ( b (cid:48) , c ) ≺ T u implies b (cid:48) , c ≺ T v .Given a tree T and an edge e , denote by T e the tree obtained from T be contracting the edge e . An edge e (cid:54) = T ρ T in ( T , σ ) is redundant (w.r.t. ( (cid:126) G , σ ) ) if ( T , σ ) explains ( (cid:126) G , σ ) and (cid:126) G ( T e , σ ) = (cid:126) G ( T , σ ) . Redundant edges have alreadybeen characterized in [12, Lemma 15, Thm. 8] in terms of equivalence classes using a somewhat complicated notation.Here we give a simpler characterization. Lemma 2.10.
Let ( (cid:126) G , σ ) be a BMG explained by a tree ( T , σ ) . The edge e = uv with v ≺ T u in ( T , σ ) is redundantw.r.t. ( (cid:126) G , σ ) if and only if (i) e is an inner edge of T and (ii) there is no arc ( a , b ) ∈ E ( (cid:126) G , σ ) such that lca T ( a , b ) = vand σ ( b ) ∈ σ ( L ( T ( u )) \ L ( T ( v ))) .Proof. Let w e be the vertex in T e resulting from the contraction e = uv with v ≺ T u in T . By assumption we have ( (cid:126) G , σ ) = (cid:126) G ( T , σ ) .First, assume that e is redundant and thus, (cid:126) G ( T e , σ ) = (cid:126) G ( T , σ ) . Then e must be an inner edge, since otherwise L ( T ) (cid:54) = L ( T e ) and, therefore, ( T e , σ ) does not explain ( (cid:126) G , σ ) . Now assume, for contradiction, that there is an arc ( a , b ) ∈ E ( (cid:126) G ) such that lca T ( a , b ) = v and σ ( b ) ∈ σ ( L ( T ( u )) \ L ( T ( v ))) . Then there is a leaf b (cid:48) ∈ L ( T ( u )) \ L ( T ( v )) with σ ( b (cid:48) ) = σ ( b ) and lca T ( a , b ) = v ≺ T u = lca T ( a , b (cid:48) ) . Thus, ( a , b (cid:48) ) / ∈ E ( (cid:126) G ) . After contraction of e , we havelca T ( a , b ) = lca T ( a , b (cid:48) ) = w e . Hence, by definition of best matches, ( a , b ) is an arc in (cid:126) G ( T e , σ ) if and only if ( a , b (cid:48) ) isan arc in (cid:126) G ( T e , σ ) ; a contradiction to the assumption that ( T e , σ ) explains ( (cid:126) G , σ ) .Conversely, assume that e = uv with v ≺ T u is an inner edge in T and that there is no arc ( a , b ) ∈ E ( (cid:126) G ) such thatlca T ( a , b ) = v and σ ( b ) ∈ σ ( L ( T ( u )) \ L ( T ( v ))) . In order to show that an edge e is redundant, we need to verify that (cid:126) G ( T , σ ) = (cid:126) G ( T e , σ ) . To this end, consider an arbitrary leaf c ∈ L ( T ) . Then we have either Case (1) c ∈ L ( T ) \ L ( T ( v )) ,or Case (2) c ∈ L ( T ( v )) .In Case (1) it is easy to verify that lca T ( c , d ) = lca T e ( c , d ) for every d ∈ L ( T ) . In particular, therefore, ( c , d ) ∈ E ( (cid:126) G ( T , σ )) if and only if ( c , d ) ∈ E ( (cid:126) G ( T e , σ )) .In Case (2), i.e. c ∈ L ( T ( v )) , consider another, arbitrary, leaf d ∈ L ( T ) . Note, if σ ( c ) = σ ( d ) , then c and d never form a best match. Thus, we assume σ ( c ) (cid:54) = σ ( d ) . Now, we consider three mutually exclusive Subcases (a)lca T ( c , d ) (cid:22) T v , (b) lca T ( c , d ) = u and (c) lca T ( c , d ) (cid:31) T u . Case (a).
Since no edge below v is contracted, we have for every d (cid:48) with σ ( d (cid:48) ) = σ ( d ) , lca T ( c , d (cid:48) ) ≺ T lca T ( c , d ) (cid:22) T v if and only if lca T e ( c , d (cid:48) ) ≺ T e lca T e ( c , d ) (cid:22) T e w e . In particular, therefore, ( c , d ) ∈ E ( (cid:126) G ( T , σ )) if andonly if ( c , d ) ∈ E ( (cid:126) G ( T e , σ )) . Case (b). lca T ( c , d ) = u and c ≺ T v implies that d ∈ L ( T ( u ) \ L ( T ( v )) and thus, σ ( d ) ∈ σ ( L ( T ( u )) \ L ( T ( v ))) .If ( c , d ) ∈ E ( (cid:126) G ( T , σ )) , then σ ( d ) / ∈ σ ( L ( T ( v ))) must hold. Therefore, ( c , d ) is still an arc after contraction of e . Forthe case ( c , d ) / ∈ E ( (cid:126) G ( T , σ )) , assume for contradiction ( c , d ) ∈ E ( (cid:126) G ( T e , σ )) . Then ( c , d ) / ∈ E ( (cid:126) G ( T , σ )) implies thatthere must be a vertex d (cid:48) with σ ( d (cid:48) ) = σ ( d ) and lca T ( c , d (cid:48) ) (cid:22) T v ≺ T u = lca T ( c , d ) . In particular, d (cid:48) ∈ L ( T ( v )) canbe chosen such that lca T ( c , d (cid:48) ) is farthest away from v and thus, ( c , d (cid:48) ) ∈ E ( (cid:126) G ( T , σ )) . Now, lca T ( c , d (cid:48) ) (cid:22) T v and ( c , d ) ∈ E ( (cid:126) G ( T e , σ )) imply that lca T e ( c , d (cid:48) ) = w e = lca T e ( c , d ) , which is only possible if lca T ( c , d (cid:48) ) = v . In summary,we found an arc ( c , d (cid:48) ) ∈ E ( (cid:126) G ( T , σ )) with lca T ( c , d (cid:48) ) = v and σ ( d (cid:48) ) ∈ σ ( L ( T ( u )) \ L ( T ( v ))) ; a contradiction to ourassumption. Hence, in Case (b) we have ( c , d ) ∈ E ( (cid:126) G ( T , σ )) if and only if ( c , d ) ∈ E ( (cid:126) G ( T e , σ )) . Case (c).
Since lca T ( c , d ) (cid:31) T u , it is again easy to see that, for every d (cid:48) with σ ( d (cid:48) ) = σ ( d ) , lca T ( c , d (cid:48) ) ≺ T lca T ( c , d ) if and only if lca T e ( c , d (cid:48) ) ≺ T e lca T e ( c , d ) and thus, ( c , d ) ∈ E ( (cid:126) G ( T , σ )) if and only if ( c , d ) ∈ E ( (cid:126) G ( T e , σ )) .In summary, we have ( c , d ) ∈ E ( (cid:126) G ( T , σ )) if and only if ( c , d ) ∈ E ( (cid:126) G ( T e , σ )) for all c , d ∈ L ( T ) . Thus, e isredundant.As a consequence of Lemma 2.10, we obtain orollary 2.11. Let ( T , σ ) be a leaf-colored tree explaining ( G , σ ) and uv an inner edge inner of T with v ≺ T u. If σ ( L ( T ( v ))) ∩ σ ( L ( T ( v (cid:48) ))) = /0 for every v (cid:48) ∈ child T ( u ) \ { v } , then uv is redundant in T (w.r.t. ( G , σ ) ).Proof. If there is an arc e = ( a , b ) ∈ E ( (cid:126) G ) with lca T ( a , b ) = v we have σ ( b ) / ∈ L ( T ( u )) \ L ( T ( v )) = ∪ v (cid:48) ∈ child ( u ) \{ v } L ( T ( v (cid:48) )) because σ ( L ( T ( v ))) ∩ σ ( L ( T ( v (cid:48) ))) = /0 for every v (cid:48) ∈ child T ( u ) \ { v } . By Lemma 2.10,the inner edge uv is redundant.Finally, we show that redundant edges can be contracted in arbitrary order, similar to [12, Lemma 6 & Cor. 2]. Tothis end, we first prove a more general statement. Lemma 2.12.
If T A is obtained from T by contracting all edges in A ⊆ E ( T ) , then (cid:126) G ( T , σ ) ⊆ (cid:126) G ( T A , σ ) .Proof. Let ( x , y ) be an arc in (cid:126) G ( T , σ ) . This implies that there is no y (cid:48) with σ ( y (cid:48) ) = σ ( y ) such that lca T ( x , y (cid:48) ) ≺ T lca T ( x , y ) . It is easy to verify that the latter is still true after contraction of an arbitrary edge e , i.e. there is no y (cid:48) with σ ( y (cid:48) ) = σ ( y ) such that lca T e ( x , y (cid:48) ) ≺ T e lca T e ( x , y ) . Hence, ( x , y ) is an arc in (cid:126) G ( T e , σ ) . Now consider the subsets A ⊂ A ⊂ ··· ⊂ A | A | = A where each | A i | = i , 1 ≤ i ≤ | A | . The argument above implies (cid:126) G ( T , σ ) ⊆ (cid:126) G ( T A , σ ) ⊆ ··· ⊆ (cid:126) G ( T A , σ ) , which completes the proof. Lemma 2.13.
Let A and B be disjoint sets of redundant edges in ( T , σ ) w.r.t. ( (cid:126) G , σ ) and denote by T A the tree obtainedby contraction of all edges in A in arbitrary order. Then B is a set of redundant edges in T A w.r.t. (cid:126) G ( T A , σ ) = (cid:126) G ( T , σ ) .Proof. By Lemma 2.12, contraction of any inner edge e = uv ∈ E ( T ) never leads to a loss of arcs in the BMG ( (cid:126) G , σ ) = (cid:126) G ( T , σ ) . Furthermore, the redundant edges in T w.r.t. ( G , σ ) are completely characterized by Lemma 2.10. Thm. 8in [12] states that by contraction of all redundant edges (in an arbitrary order), one obtains the unique least resolvedtree ( T ∗ , σ ) of ( (cid:126) G , σ ) . As argued above, no arc of (cid:126) G ( T , σ ) can be lost in the stepwise contraction of redundant edges.Together with (cid:126) G ( T , σ ) = (cid:126) G ( T ∗ , σ ) = ( (cid:126) G , σ ) this implies (cid:126) G ( T A , σ ) = ( (cid:126) G , σ ) . Since by assumption A ∩ B = /0 and A ∪ B is a set of redundant edges w.r.t. ( (cid:126) G , σ ) , we have ( T A ) B = T A ∪ B and (cid:126) G ( T A , σ ) = ( (cid:126) G , σ ) = (cid:126) G ( T A ∪ B , σ ) = (cid:126) G (( T A ) B , σ ) .Hence, B is a set of redundant edges in T A w.r.t. (cid:126) G ( T A , σ ) . An evolutionary scenario extends the map σ : L ( T ) → L ( S ) to an embedding of the gene tree into the species tree.It (implicitly) describes different types of evolutionary events: speciations, gene duplications, and gene losses. Inthis contribution we do not consider other types of events such as horizontal gene transfer. Gene losses do not appearexplicitly since L ( T ) only contains extant genes. Inner vertices in the gene tree T that designate speciations have theircorrespondence in inner vertices of the species tree. In contrast, gene duplications occur independently of speciationsand thus belong to edges of the species tree. The embedding of T into S is formalized by Definition 2.14 (Reconciliation Map) . Let S = ( W , F ) and T = ( V , E ) be two planted phylogenetic trees and let σ : L ( T ) → L ( S ) be a surjective map. A reconciliation from ( T , σ ) to S is a map µ : V → W ∪ F satisfying (R0)
Root Constraint. µ ( x ) = S if and only if x = T . (R1) Leaf Constraint.
If x ∈ L ( T ) , then µ ( x ) = σ ( x ) . (R2) Ancestor Preservation.
If x ≺ T y, then µ ( x ) (cid:22) S µ ( y ) . (R3) Speciation Constraints.
Suppose µ ( x ) ∈ W .(i) µ ( x ) = lca S ( µ ( v (cid:48) ) , µ ( v (cid:48)(cid:48) )) for at least two distinct children v (cid:48) , v (cid:48)(cid:48) of x in T .(ii) µ ( v (cid:48) ) and µ ( v (cid:48)(cid:48) ) are incomparable in S for any two distinct children v (cid:48) and v (cid:48)(cid:48) of x in T . Several alternative definitions of reconciliation maps for duplication/loss scenarios have been proposed in theliterature, many of which have been shown to be equivalent. This type of reconciliation map has been established in[13]. Moreover, it has been shown in [13] that the axiom set used here is equivalent to axioms that are commonly usedin the literature, see e.g. [7, 15, 19, 36, 40, 50], and the references therein. Without any further constraints, Def. 2.14gives rise to a well-known result:
Lemma 2.15. [13, Lemma 3] For every tree ( T , σ ) there is a reconciliation map µ to any species tree S with leaf setL ( S ) = σ ( L ( T )) . The reconciliation map µ from ( T , σ ) to S determines the types of evolutionary events in T . This can be formal-ized by associating an event labeling with the vertices of T . We use the notation introduced in [13]: Definition 2.16.
Given a reconciliation map µ from ( T , σ ) to S, the event labeling on T (determined by µ ) is themap t µ : V ( T ) → { (cid:125) , (cid:12) , (cid:32) , (cid:3) } given by:t µ ( u ) = (cid:125) if u = T , i.e., µ ( u ) = S (root) (cid:12) if u ∈ L ( T ) , i.e., µ ( u ) ∈ L ( S ) (leaf) (cid:32) if µ ( u ) ∈ V ( S ) (speciation) (cid:3) else, i.e., µ ( u ) ∈ E ( S ) (duplication) he following result is a simple but useful consequence of combining the axioms of the reconciliation map withthe event labeling of Def. 2.16. Lemma 2.17. [13, Lemma 2] Let µ be a reconciliation map from ( T , σ ) to a tree S and suppose that u ∈ V ( T ) is a vertex with µ ( u ) ∈ V ( S ) and thus, t ( µ ( u )) = (cid:32) . Then, σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) = /0 for any two distinctv , v ∈ child ( u ) . We will regularly make use of the observation that, by contraposition of Lemma 2.17, σ ( L ( T ( v ))) ∩ σ ( L ( T ( v (cid:48) ))) (cid:54) = /0 for two distinct v , v ∈ child ( u ) implies that µ ( u ) ∈ E ( S ) , and thus t µ ( u ) = (cid:3) .Lemma 2.17 suggests to define event-labeled trees as trees ( T , t ) endowed with a map t : V ( T ) → { (cid:125) , (cid:12) , (cid:32) , (cid:3) } such that t ( T ) = (cid:125) and t ( u ) = (cid:12) for all u ∈ L ( T ) . In [13], Lemma 2.17 also served as a motivation for Definition 2.18.
Let ( T , σ ) be a leaf-colored tree. The extremal event labeling of T is the map (cid:98) t T : V ( T ) →{ (cid:125) , (cid:12) , (cid:32) , (cid:3) } defined for u ∈ V ( T ) by (cid:98) t T ( u ) = (cid:125) if u = T (cid:12) if u ∈ L ( T ) (cid:3) if there are two children v , v ∈ child ( u ) such that σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0 (cid:32) otherwise The extremal event labeling (cid:98) t T is completely defined by ( T , σ ) and, in contrast to the event labeling in Def. 2.16,does not depend on a reconciliation map. However, there is no guarantee that there always exists a reconciliation map µ from ( T , σ ) to any species tree S such that t µ = (cid:98) t T , cf. [13, Fig. 2] for a counterexample.The event labeling on T defines the orthology graph. Definition 2.19.
The orthology graph Θ ( T , t ) of an event-labeled tree ( T , t ) has vertex set L ( T ) and edges uv ∈ E ( Θ ) if and only if t ( lca ( u , v )) = (cid:32) . The orthology graph is often referred to as the orthology relation. Orthology graphs coincide with a well-knowngraph class:
Theorem 2.20. [22, Cor. 4] A graph G is an orthology graph for some event-labeled tree ( T , t ) , i.e., G = Θ ( T , t ) ifand only if G is a cograph. One of many equivalent characterization of cographs identifies them with the graphs that do not contain an inducedpath P on four vertices [4].The orthology graph is a subgraph of the BMG for any given reconciliation map connecting a gene with a speciestree. Theorem 2.21. [13, Lemma 4 & 5] Let ( T , σ ) be a leaf-colored tree and µ a reconciliation map from ( T , σ ) to somespecies tree S. Then Θ ( T , t µ ) ⊆ Θ ( T , (cid:98) t T ) ⊆ (cid:126) G ( T , σ ) . In particular, t µ ( v ) = (cid:32) implies (cid:98) t T ( v ) = (cid:32) for any reconciliation map. By contraposition, therefore, if (cid:98) t T ( v ) = (cid:3) then t µ ( v ) = (cid:3) for all possible reconciliation maps µ from ( T , σ ) to any species tree S . A crucial implication ofThm. 2.21 is that edges in a BMG (cid:126) G ( T , σ ) always correspond to either correct orthologous pairs of genes or false-positive orthology assignments. Hence, (cid:126) G ( T , σ ) does never contain false-negative orthology assignments. Denote by ( (cid:101) T , (cid:101) t , σ ) the true leaf-colored and event-labeled gene tree and let ( (cid:126) G , σ ) be a BMG estimated for thesame data. An edge xy of ( (cid:126) G , σ ) , or equivalently of the corresponding RBMG ( G , σ ) is a false-positive orthologyassignment if xy ∈ E ( G ) but xy / ∈ E ( Θ ( (cid:101) T , (cid:101) t , σ )) ; it is a false negative orthology assignment if xy / ∈ E ( G ) but xy ∈ E ( Θ ( (cid:101) T , (cid:101) t , σ )) . There are two distinct sources of error: inaccuracies in the assignment of best matches [46] and limitsin the reconstruction of ( (cid:101) T , (cid:101) t , σ ) from Best Match Graphs [13]. In this contribution, we are only concerned withthe latter. Observation 1 of [13], see also Thm. 2.21 above, implies that for evolutionary scenarios that involve onlyspeciations, gene duplications, and gene losses, there are no false-negative orthology assignments. Our task at handtherefore reduces to understanding the false-positive orthology assignments.We first note that these cannot be avoided altogether. The simplest example, Fig. 1, comprises a gene duplicationand a subsequent speciation and complementary gene losses in the descendant lineages such that each paralog survivesonly in one of them. In this situation xy is a reciprocal best match. If there are no other descendants that harbor geneswitnessing the duplication event, then the framework of best matches provides no information to recognize xy as afalse-positive assignment. y x y x yx y x y T T G(T , σ ) = G(T , σ )~ ~~ ~ Figure 1:
Complementary gene loss (left) that is not witnessed by any other species. In particular, two differenttrue histories ( ˜ T , ˜ t , σ ) and ( ˜ T , ˜ t , σ ) produce the same BMG, whereas the only edge xy is a true false-positive inthe first history. ( T , σ ) - and Unambiguous False-Positive Edges In order to study false-positive orthology assignments in more detail, we assume that we have a tree ( T , σ ) thatexplains the BMG ( (cid:126) G , σ ) . Note that we do not make the assumption that ( T , σ ) is least resolved. Definition 3.1 ( ( T , σ ) -false-positive) . Let ( T , σ ) be a tree explaining the BMG ( (cid:126) G , σ ) . An edge xy in (cid:126) G is called ( T , σ ) -false-positive , or ( T , σ ) - fp for short, if for every reconciliation map µ from ( T , σ ) to some species tree S wehave t µ ( lca T ( x , y )) = (cid:3) , i.e., µ ( lca T ( x , y )) ∈ E ( S ) , In other words, xy is called ( T , σ ) - fp whenever x and y cannot be orthologous w.r.t. every possible reconciliation µ from ( T , σ ) to any species tree. Interestingly, ( T , σ ) - fp s can be identified without considering reconciliation mapsexplicitly. Lemma 3.2.
Let ( (cid:126) G , σ ) be a BMG, xy be an edge in (cid:126) G and ( T , σ ) be a tree that explains ( (cid:126) G , σ ) . Then, the followingstatements are equivalent:1. The edge xy is ( T , σ ) - fp .2. There are two children v and v of lca T ( x , y ) such that σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0 .3. For the extremal labeling (cid:98) t T of ( T , σ ) it holds that (cid:98) t T ( lca T ( x , y )) = (cid:3) .Proof. (2) implies (1). Suppose that there are two children v and v of lca T ( x , y ) such that σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0. By Lemma 2.17, µ ( lca T ( x , y )) ∈ E ( S ) and thus, t µ ( lca T ( x , y )) = (cid:3) for all possible reconcilia-tion maps µ from ( T , σ ) to any species tree S . Hence, xy is ( T , σ ) - fp . (1) implies (2). By contraposition, let v = lca T ( x , y ) and suppose that for all distinct children v i , v j ∈ child ( v ) = { v ,..., v k } , k ≥ σ ( L ( T ( v i ))) ∩ σ ( L ( T ( v j ))) = /0. In the following, we show that there is a species tree S and a reconciliation map µ from ( T , σ ) to S such that t µ ( lca ( x , y )) = (cid:32) , which implies that xy is not ( T , σ ) - fp .We construct the species tree S as follows: S has root edge 0 S ρ S . Now add k children u ,..., u k to ρ S .For each of these children u i with | σ ( L ( T ( v i ))) | >
1, we add a leaf t for every color t ∈ σ ( L ( T ( v i ))) andthe edge u i t . Any other u i is considered to be a leaf in S , and we identify u i with the single element in σ ( L ( T ( v i ))) . Furthermore, add for all t ∈ σ ( L ( T )) \ σ ( L ( T ( v ))) a leaf t that is adjacent to ρ S . Since the colorsets σ ( L ( T )) \ σ ( L ( T ( v ))) , σ ( L ( T ( v ))) ,..., σ ( L ( T ( v k )) are pairwise distinct, S is well-defined, and, by construc-tion, a planted phylogenetic tree. To construct a reconciliation map we put (i) µ ( T ) = S ; (ii) µ ( x ) = σ ( x ) for all x ∈ L ( T ) ; (iii) µ ( v ) = ρ S ; (iv) µ ( w ) = S ρ S for all w ∈ V ( T \ T ( v )) ; and (v) µ ( w ) = ρ S u i for all w ∈ V ( T ( v i )) . ByCondition (i) and (ii), the Axioms (R0) and (R1) are satisfied, respectively. By Condition (v), we have µ ( v i ) = ρ S u i if v i is an inner vertex. Otherwise, v i is a leaf and | σ ( L ( T ( v i ))) | =
1. Therefore, µ ( v i ) = σ ( v i ) = u i by (ii) and by con-struction. It is easy to verify that µ satisfies (R2). A sketch of construction of the species tree S and the reconciliationmap µ is provided in Fig. 2. v i v k ... ...... ... ...0 S ρ S u u i u k ... ... v ST x Figure 2:
Visualization of the construction of a species tree S and reconciliation map µ as described in the proofof Lemma 3.2. Note that, in the example, v k is already a leaf in the gene tree T . Hence, the corresponding u k isalso a leaf since | σ ( L ( T ( v k ))) | =
1. Moreover, note that for x ∈ L ( T ) \ L ( T ( v )) , it is possible that µ ( x ) = u j or µ ( x ) = t with t ∈ child S ( u j ) for some u j .The only vertex of T that is mapped to a vertex in S is v . Hence, it remains to show that µ ( v ) = ρ S ∈ V ( S ) satisfies(R3). Note that for every two distinct children v i , v j of v we have µ ( v i ) ∈ { ρ S u i , u i } and µ ( v j ) ∈ { ρ S u j , u j } . In anycase, µ ( v i ) and µ ( v j ) are incomparable in S . Hence, (R3.ii) is satisfied. In particular, µ ( v ) = ρ S = lca S ( µ ( v i ) , µ ( v j )) for all distinct v i , v j ∈ child ( v ) . Hence, (R3.i) is satisfied. In summary, µ is a reconciliation map from ( T , σ ) to S .Since µ ( v ) = ρ S ∈ V ( S ) , we have t µ ( v ) = (cid:32) .Statements (2) and (3) are equivalent by definition of the extremal event labeling.Lemma 3.2 implies that ( T , σ ) - fp can be verified in polynomial-time for any given gene tree ( T , σ ) . Definition 3.3 (Unambiguous false-positive) . Let ( (cid:126) G , σ ) be a BMG. An edge xy in (cid:126) G is called unambiguous false-positive ( u-fp ) if for all trees ( T , σ ) that explain ( (cid:126) G , σ ) the edge xy is ( T , σ ) - fp . Hence, if an edge xy in (cid:126) G is u-fp , then it is in particular ( T , σ ) - fp in the true history that explains (cid:126) G . Thus, u-fp edges are always “correct” false-positives.Clearly, not all “correct” false-positives are covered by this definition, since it may possible that, for an edge xy in (cid:126) G , we have t µ ( lca T ( x , y )) = (cid:3) for the true gene tree and the true species tree, but xy is not ( T (cid:48) , σ ) - fp for some genetree ( T (cid:48) , σ ) possibly different from ( T , σ ) . One of the simplest examples is shown in Fig. 1, assuming that ( (cid:101) T , σ ) isthe “true” history. Since t µ ( lca (cid:101) T ( x , y )) = (cid:32) may be possible (Fig. 1, right), the edge xy is not ( (cid:101) T , σ ) - fp and thereforenot u-fp . S ∩ In this subsection, we introduce the color-intersection S ∩ , which can be used to identify false-positive edges, andestablish its most salient properties. Given a gene tree ( T , σ ) and a pair of distinct leaves x , y ∈ L ( T ) we denote by v x , v y ∈ child T ( lca T ( x , y )) the unique children of the last common ancestor of x and y for which x (cid:22) T v x and y (cid:22) T v y .That is, T ( v x ) and T ( v y ) are the subtrees of T rooted in the children of lca T ( x , y ) with x ∈ L ( T ( v x )) and y ∈ L ( T ( v y )) .The set S ∩ T ( x , y ) : = σ ( L ( T ( v x ))) ∩ σ ( L ( T ( v y ))) (1)contains the colors, i.e. species, that are common to both subtrees. Lemma 2.4 immediately implies Corollary 3.4.
Let xy be an edge in a BMG ( (cid:126) G , σ ) . Then σ ( { x , y } ) ∩ S ∩ T ( x , y ) = /0 for all trees ( T , σ ) that explain ( (cid:126) G , σ ) . The following result shows that the color-intersection of a given edge in a BMG ( (cid:126) G , σ ) in fact does not dependon the tree representation of ( (cid:126) G , σ ) . Lemma 3.5.
Let ( (cid:126) G , σ ) be a BMG and ( T ∗ , σ ) the corresponding unique least resolved tree explaining ( (cid:126) G , σ ) . Then,for each tree ( T , σ ) that explains ( (cid:126) G , σ ) , every edge xy in ( (cid:126) G , σ ) satisfies S ∩ T ∗ ( x , y ) = S ∩ T ( x , y ) . Thus, in particular, S ∩ T ∗ ( x , y ) (cid:54) = /0 if and only if S ∩ T ( x , y ) (cid:54) = /0 . y x' z yx zx'x y x' z ? T T ? ? Figure 3:
The BMG ( (cid:126) G , σ ) shown on the right is explained by both ( T , σ ) , which is the unique least resolvedtree for ( (cid:126) G , σ ) , and ( T , σ ) . The vertices labeled (cid:3) must be duplications due to Lemma 2.17, while the verticeslabeled “?” could be both duplications or speciations. The edges xz , x (cid:48) z and yz are ( T , σ ) - fp but not ( T , σ ) - fp (cf.Lemma 3.2). Thus, neither of the edges xz , x (cid:48) z and yz is u-fp . Proof.
Let ( T , σ ) be an arbitrary tree that explains ( (cid:126) G , σ ) . Moreover, let xy be an edge in (cid:126) G and denote by v x and v y be the unique children v x , v y ∈ child T ( lca T ( x , y )) with x (cid:22) T v x and y (cid:22) T v y . Analogously, v ∗ x and v ∗ y are the uniquechildren v ∗ x , v ∗ y ∈ child T ∗ ( lca T ∗ ( x , y )) with x (cid:22) T ∗ v ∗ x and y (cid:22) T ∗ v ∗ y .First, we show that t ∈ S ∩ T ∗ ( x , y ) implies t ∈ S ∩ T ( x , y ) . Since ( T , σ ) explains ( (cid:126) G , σ ) , we apply Thm. 2.6 toconclude that T is a refinement of T ∗ and thus, C ( T ∗ ) ⊆ C ( T ) . Therefore, L ( T ∗ ( lca T ∗ ( x , y )) , L ( T ∗ ( v ∗ x )) and L ( T ∗ ( v ∗ y )) are contained in C ( T ) . This implies that there must be vertices u , w x , and w y in T with L ( T ( u )) = L ( T ∗ ( lca T ∗ ( x , y )) , L ( T ( w x )) = L ( T ∗ ( v ∗ x )) and L ( T ( w y )) = L ( T ∗ ( v ∗ y )) . Note that L ( T ∗ ( v ∗ x )) ∩ L ( T ∗ ( v ∗ y )) = /0, and thus L ( T ( w x )) ∩ L ( T ( w y )) = /0. In particular, w x and w y are incomparable in T . Moreover, u = lca T ( x , y ) = lca T ( w x , w y ) , thus wehave w x (cid:22) T v x and w y (cid:22) T v y . Therefore, L ( T ∗ ( v ∗ x )) ⊆ L ( T ( v x )) and L ( T ∗ ( v ∗ y )) ⊆ L ( T ( v y )) . Therefore, t ∈ S ∩ T ∗ ( x , y ) implies t ∈ S ∩ T ( x , y ) .Now, we show that t ∈ S ∩ T ( x , y ) implies t ∈ S ∩ T ∗ ( x , y ) . Let t ∈ S ∩ T ( x , y ) (cid:54) = /0. In this case, t ∈ σ ( L ( T ( v x ))) and wecan choose a vertex z ∈ L ( T ( v x )) such that σ ( z ) = t and lca T ( x , z ) is as far away as possible from v x compared toall lca T ( x , z ) with z ∈ L [ t ] , i.e., lca T ( x , z ) (cid:22) T lca T ( x , z ) for all z ∈ L [ t ] . Thus, ( x , z ) ∈ E ( (cid:126) G ) . An analogous argumentensures that there is a vertex z ∈ L ( T ( v y )) such that σ ( z ) = t and ( y , z ) ∈ E ( (cid:126) G ) . Clearly, lca T ( x , z ) = lca T ( x , y ) = lca T ( y , z ) and thus lca T ( x , z ) (cid:22) T v x ≺ T lca T ( x , z ) , which in turn implies that ( x , z ) / ∈ E ( (cid:126) G ) . Since ( x , z ) ∈ E ( (cid:126) G ) and ( x , z ) / ∈ E ( (cid:126) G ) , we obtain the informative triple xz | z for (cid:126) G . Analogously, yz | z is an informative triple for (cid:126) G .Lemma 2.9 and the fact that T ∗ explains ( (cid:126) G , σ ) implies that there are distinct vertices v , v ∈ child T ∗ ( lca T ∗ ( x , y )) such that x , z (cid:22) T ∗ v and y , z (cid:22) T ∗ v . Since t = σ ( z ) = σ ( z ) , we have t ∈ S ∩ T ∗ ( x , y ) .Finally, t ∈ S ∩ T ∗ ( x , y ) if and only if t ∈ S ∩ T ( x , y ) implies both S ∩ T ∗ ( x , y ) = S ∩ T ( x , y ) and S ∩ T ∗ ( x , y ) (cid:54) = /0 if and onlyif S ∩ T ( x , y ) (cid:54) = /0. Remark 1.
By Lemma 3.5, we have S ∩ T ( x , y ) = S ∩ T ∗ ( x , y ) for every tree ( T , σ ) explaining a BMG ( (cid:126) G , σ ) withcorresponding least resolved tree ( T ∗ , σ ) . Therefore, it is sufficient to consider S ∩ T ∗ ( x , y ) . We will therefore drop theexplicit reference to the tree and simply write S ∩ ( x , y ) . We can verify in polynomial time whether or not S ∩ ( x , y ) = /0 because the least resolved tree ( T ∗ , σ ) explaining ( (cid:126) G , σ ) can be computed in polynomial time. Proposition 3.6.
Every edge xy in a BMG ( (cid:126) G , σ ) with S ∩ ( x , y ) (cid:54) = /0 is u-fp .Proof. By Lemma 3.5 and Remark 1, S ∩ ( x , y ) (cid:54) = /0 if and only if S ∩ T ( x , y ) (cid:54) = /0 for all trees ( T , σ ) that explain ( (cid:126) G , σ ) .By Lemma 2.17, µ ( lca T ( x , y )) ∈ E ( S ) and thus, t µ ( lca T ( x , y )) = (cid:3) for all trees ( T , σ ) that explain ( (cid:126) G , σ ) . Hence, xy is u-fp .As we shall see later, the converse of Prop. 3.6 is not always satisfied (cf. also Fig. 5). An immediate consequenceof Prop. 3.6 is the following: Corollary 3.7.
An edge xy in a BMG (cid:126) G ( T , σ ) with S ∩ ( x , y ) (cid:54) = /0 is ( T , σ ) - fp . The converse, however, is not true in general. For an example consider the unique least resolved tree ( T , σ ) thatexplains the BMG ( (cid:126) G , σ ) in Fig. 3. Here, the edge xz is ( T , σ ) - fp (cf. Lemma 3.2) but S ∩ ( x , z ) = /0. Although notnecessarily true in general, the converse of Prop. 3.6 and Cor. 3.7 does hold for the special case of binary trees. Lemma 3.8.
Let xy be an edge in (cid:126) G ( T , σ ) and suppose lca T ( x , y ) is a binary vertex. Then, the following threestatements are equivalent: . The edge xy is ( T , σ ) - fp .2. S ∩ ( x , y ) (cid:54) = /0 .3. The edge xy is u-fp .Proof. (1) implies (2). Suppose xy is ( T , σ ) - fp . Since v is binary, it has precisely two children v and v . In particular, v = lca T ( x , y ) implies that that x (cid:22) T v i and x (cid:22) T v j for i , j ∈ { , } being distinct. By Lemma 3.2, the two children v and v of v satisfy σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0. By Lemma 3.5 and Remark 3.5, we have S ∩ ( x , y ) (cid:54) = /0. (2) implies (3). If S ∩ ( x , y ) (cid:54) = /0, we can apply Prop. 3.6 to conclude that xy is u-fp . (3) implies (1). By definition, if xy is u-fp , then it is in particular also ( T , σ ) - fp . Theorem 3.9.
Let ( (cid:126) G , σ ) be a BMG that is explained by a binary tree ( T , σ ) . Then xy is ( T , σ ) - fp if and only if xy is u-fp .Proof. For every edge xy in (cid:126) G the last common ancestor lca T ( x , y ) is binary. Now apply Lemma 3.8.Thm. 3.9 implies that all u-fp edges can be detected in a BMG that is explained by a known binary gene tree.However, not all BMGs ( (cid:126) G , σ ) can be explained by a binary tree, as e.g. the BMG in Fig. 6(A). Thm. 3.9 does notgeneralize to the non-binary case, and S ∩ ( x , y ) is not sufficient to identify all u-fp edges. Furthermore, it is notdifficult to find non-binary trees in which ( T , σ ) - fp and u-fp edges are not the same: As show in Fig. 3, the edge xz inis ( T , σ ) - fp but not ( T , σ ) - fp according to Lemma 3.2. Since both trees explain the same BMG, the edge xy is not u-fp . S ∩ ( x , y ) (cid:54) = /0 and Quartets Since every orthology graph is a cograph (Thm. 2.20) and thus free of induced P s, every induced P in the RBMGnecessarily contains a false-positive orthology assignments. The subgraphs of the BMG spanned by a P in itssymmetric part (i.e., the RBMG) are known as quartets. The quartets on three colors of a BMG ( (cid:126) G , σ ) fall into threedistinct classes depending on the coloring and the additional, non-symmetric edges (cf. [14, Lemma 32]). We write (cid:104) abcd (cid:105) or, equivalently, (cid:104) dcba (cid:105) for and induced P with edges ab , bc , and cd . Definition 3.10 (Good, bad, and ugly quartets) . Let ( (cid:126) G , σ ) be a BMG with symmetric part ( G , σ ) and vertex set L,and let Q : = { x , y , z , z (cid:48) } ⊆ L with x ∈ L [ r ] , y ∈ L [ s ] , and z , z (cid:48) ∈ L [ t ] . The set Q, resp., the induced subgraph ( (cid:126) G [ Q ] , σ | Q ) is a good quartet if (i) (cid:104) zxyz (cid:48) (cid:105) is an induced P in ( G , σ ) and (ii) ( z , y ) , ( z (cid:48) , x ) ∈ E ( (cid:126) G ) and ( y , z ) , ( x , z (cid:48) ) / ∈ E ( (cid:126) G ) ,a bad quartet if (i) (cid:104) zxyz (cid:48) (cid:105) is an induced P in ( G , σ ) and (ii) ( y , z ) , ( x , z (cid:48) ) ∈ E ( (cid:126) G ) and ( z , y ) , ( z (cid:48) , x ) / ∈ E ( (cid:126) G ) ,an ugly quartet if (cid:104) zxz (cid:48) y (cid:105) is an induced P in ( G , σ ) .The edge xy in a good quartet (cid:104) zxyz (cid:48) (cid:105) is its middle edge. The edge zx of an ugly quartet (cid:104) zxz (cid:48) y (cid:105) or a bad quartet (cid:104) zxyz (cid:48) (cid:105) is called its first edge. First edges in ugly quartets are uniquely determined due to the colors. In bad quartets,this is not the case and therefore, the edge yz (cid:48) in (cid:104) zxyz (cid:48) (cid:105) is a first edge as well. An RBMG never contains induced P s on two colors [14, Observation 5]. This, in particular, implies that for theinduced P s in Def. 3.10 the colors r , s , t must be pairwise distinct. Note that BMGs can also contain induced P s onfour colors. However, these are of no further interest for our purpose.Good quartets are the characteristic subgraphs that appear in a BMG whenever a complementary gene loss (asshown in Fig. 1) is “witnessed” by a third species ( σ ( z ) = σ ( z (cid:48) ) ), in which both child branches of the problematicduplication event survive. We remark that previous work also noted that complementary gene loss can be resolvedsuccessfully under certain circumstances [6] such as this one. The key property of good quartets for our purpose is aconsequence of [13, Cor. 5]: Proposition 3.11. If (cid:104) zxyz (cid:48) (cid:105) is a good quartet in the BMG ( (cid:126) G , σ ) , then S ∩ ( x , y ) (cid:54) = /0 and thus, xy is u-fp .Proof. Let (cid:104) zxyz (cid:48) (cid:105) in ( (cid:126) G , σ ) be a good quartet in ( (cid:126) G , σ ) and let ( T , σ ) be an arbitrary tree explaining ( (cid:126) G , σ ) . Then[14, Lemma 36] implies that v : = lca T ( x , y , z , z (cid:48) ) has two distinct children v , v ∈ child ( v ) such that x , z (cid:22) T v and y , z (cid:48) (cid:22) T v . Hence, v = lca T ( x , y ) . Since σ ( z ) ∈ σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) , we have S ∩ ( x , y ) (cid:54) = /0 and, by Prop. 3.6,the edge xy is u-fp .Prop. 3.11 thus provides a convenient way to identify unambiguous false-positive edges in the BMG. Lemma 3.12.
If xy is an edge in a BMG (cid:126) G ( T , σ ) and t ∈ S ∩ ( x , y ) , then there is a good quartet (cid:104) z x ∗ y ∗ z (cid:105) such that (a) σ ( x ∗ ) = σ ( x ) , σ ( y ∗ ) = σ ( y ) , and σ ( z ) = σ ( z ) = t; (b) x ∗ , z ∈ L ( T ( v x )) and y ∗ , z ∈ L ( T ( v y )) with v x and v y being the unique children in child T ( lca T ( x , y )) such thatwith x (cid:22) T v x and y (cid:22) T v y . z' zx' yx zx'x z y z'x'v v x' z z' yx(T, σ ) (G, σ ) Figure 4:
Example for a ( T , σ ) - fp edge xy in ( (cid:126) G , σ ) which is not the middle edge of a good quartet, but the firstedge in an ugly quartet (right). Note, ( (cid:126) G , σ ) does not contain bad quartets. Proof.
Consider an edge xy of (cid:126) G ( T , σ ) and a color t ∈ S ∩ ( x , y ) . By Cor. 3.4, t (cid:54) = σ ( x ) , σ ( y ) . Lemma 2.3 ensuresthe existence of an edge x ∗ z in (cid:126) G for some leaves x ∗ ∈ L ( T ( v x )) ∩ L [ σ ( x )] and z ∈ L ( T ( v x )) ∩ L [ t ] . By the samearguments as in the proof of Cor. 3.4, we can conclude that z y (cid:48) is not an edge in (cid:126) G for all y (cid:48) ∈ L ( T ( v y )) ∩ L [ σ ( y )] .However, ( z , y (cid:48) ) ∈ E ( (cid:126) G ) since the color of y (cid:48) is not present in T ( v x ) . Likewise, there are leaves y ∗ ∈ L ( T ( v y )) ∩ L [ σ ( y )] and z ∈ L ( T ( v y )) ∩ L [ t ] such that y ∗ z forms an edge in (cid:126) G . Reusing the arguments from L ( T ( v x )) , we findthat x (cid:48) z is not an edge in (cid:126) G and ( z , x (cid:48) ) ∈ E ( (cid:126) G ) for any x (cid:48) ∈ L ( T ( v x )) ∩ L [ σ ( x )] . Finally, σ ( x ) / ∈ σ ( L ( T ( v y ))) and σ ( y ) / ∈ σ ( L ( T ( v x ))) implies that x ∗ y ∗ forms an edge in (cid:126) G . Hence, (cid:104) z x ∗ y ∗ z (cid:105) is a good quartet.Note that the edge x ∗ y ∗ in Lemma 3.12 is the middle edge of a good quartet. For completeness, we provide aresult for the identification of u-fp edges using bad quartets Proposition 3.13.
Let (cid:104) zxyz (cid:48) (cid:105) be a bad quartet in a BMG ( (cid:126) G , σ ) . Then, the edges xz and yz (cid:48) are u-fp and every treethat explains ( (cid:126) G , σ ) is non-binary.Proof. Let ( T , σ ) be an arbitrary tree that explains ( (cid:126) G , σ ) , set u : = lca T ( x , z ) and let v x , v z ∈ child T ( u ) be the twodistinct children of u such that x (cid:22) T v x and z (cid:22) T v z . By symmetry, it suffices to show that xz is u-fp . Since (cid:104) zxyz (cid:48) (cid:105) is a bad quartet, we have ( x , z ) , ( x , z (cid:48) ) ∈ E ( (cid:126) G ) and thus lca T ( x , z (cid:48) ) = lca T ( x , z ) = u . Let v z (cid:48) ∈ child T ( u ) be the childof u such that z (cid:48) (cid:22) T v z (cid:48) . Since lca T ( x , z (cid:48) ) = u we have v x (cid:54) = v z (cid:48) . Now, assume for contradiction that v z = v z (cid:48) , andthus z (cid:48) ∈ L ( T ( v z )) . Since (cid:104) zxyz (cid:48) (cid:105) is a bad quartet, we have ( z (cid:48) , x ) / ∈ E ( (cid:126) G ) , which implies the existence of a vertex x (cid:48) with σ ( x ) = σ ( x (cid:48) ) and lca T ( x (cid:48) , z (cid:48) ) ≺ T lca T ( x , z (cid:48) ) = u and therefore, x (cid:48) ∈ L ( T ( v z )) . However, this implies thatlca T ( x (cid:48) , z ) (cid:22) T v z ≺ T u = lca T ( x , z ) , which together with σ ( x ) = σ ( x (cid:48) ) contradicts the fact that xz is an edge in (cid:126) G .Hence, v z (cid:54) = v z (cid:48) . Therefore, σ ( z ) = σ ( z (cid:48) ) ∈ σ ( L ( T ( v z ))) ∩ σ ( L ( T ( v z (cid:48) ))) (cid:54) = /0 for distinct children v z , v z (cid:48) ∈ child T ( u ) .By Lemma 3.2, the edge xz is ( T , σ ) - fp and since ( T , σ ) was chosen arbitrarily, the edge xz is u-fp . Moreover, wehave shown that v x , v z and v z (cid:48) must be pairwise distinct and thus, ( T , σ ) is non-binary.Edges of good and bad quartets can be used to identify u-fp edges. The example in Fig. 4 shows, however, thatnot all false-positive edges xy with S ∩ ( x , y ) (cid:54) = /0 are middle edges of good quartets or first edges of bad quartets.The top vertex in the tree in Fig. 4 must be a duplication event since S ∩ ( x , y ) = σ ( L ( T ( v x ))) ∩ σ ( L ( T ( v y ))) (cid:54) = /0 (cf.Prop. 3.6). The only good quartet is (cid:104) zx (cid:48) yz (cid:48) (cid:105) identifying x (cid:48) y as false-positive. Moreover, ( (cid:126) G , σ ) does not contain abad quartet. The edge xy , on the other hand, is the first edge in the ugly quartet (cid:104) xyx (cid:48) z (cid:105) . Thus, in this example, thereis no evidence provided by good or bad quartets to identify the edge xy as u-fp . Therefore, we focus on ugly quartetsas additional source of information to identify u-fp ’s. In particular, as it will turn out, u-fp edges in bad quartetsare entirely covered by u-fp edges in good quartets, ugly quartets and a more general subgraph construction that isintroduced in Section 3.5. Proposition 3.14. If (cid:104) xyx (cid:48) z (cid:105) is an ugly quartet in a BMG ( (cid:126) G , σ ) , then the edges xy and yx (cid:48) are u-fp .Proof. Consider an ugly quartet (cid:104) xyx (cid:48) z (cid:105) . Let ( T , σ ) be an arbitrary tree explaining ( (cid:126) G , σ ) , put u : = lca T ( x , y ) and let v x , v y ∈ child T ( u ) be the two distinct children of u such that x (cid:22) T v x and y (cid:22) T v y .Since x (cid:48) y and xy are edges in (cid:126) G we have lca T ( x (cid:48) , y ) (cid:22) T u . Moreover, Cor. 3.4 implies σ ( x (cid:48) ) = σ ( x ) / ∈ σ ( L ( T ( v y ))) and thus x (cid:48) / ∈ L ( T ( v y )) . Therefore, lca T ( x (cid:48) , y ) = lca T ( x , y ) = u .Now consider an arbitrary reconciliation map µ from ( T , σ ) to some species tree S . The existence of µ is guar-anteed by Lemma 2.15. If x (cid:48) / ∈ L ( T ( v x )) , then there is a vertex v ∈ child T ( u ) , v (cid:54) = v x , v y such that x (cid:48) (cid:22) T v and σ ( x ) = σ ( x (cid:48) ) ∈ σ ( L ( T ( v x ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0, which by Lemma 2.17 implies t µ ( u ) = (cid:3) .Now suppose x (cid:48) ∈ L ( T ( v x )) and recall that x (cid:48) z is an edge in (cid:126) G by assumption. Since lca T ( x (cid:48) , z ) and lca T ( x , x (cid:48) ) areboth ancestors of x (cid:48) they are comparable. If lca T ( x (cid:48) , z ) (cid:31) T lca T ( x , x (cid:48) ) , then lca T ( x , z ) = lca T ( x (cid:48) , z ) . Together with thefact that x (cid:48) z is an edge in (cid:126) G but not xz , this implies that there is a z (cid:48) ∈ L [ σ ( z )] such that lca T ( x , z (cid:48) ) ≺ T lca T ( x , z ) . Thisin turn implies lca T ( x (cid:48) , z (cid:48) ) ≺ T lca T ( x (cid:48) , z ) , which contradicts that x (cid:48) z is an edge in (cid:126) G . Therefore, x (cid:48) ∈ L ( T ( v x )) implies x zx'x z y z'x' z' Figure 5:
The edge xy is u-fp since it is the first edge of an ugly quartet. However, S ∩ ( x , y ) = /0 and thus, theconverse of Prop. 3.15 is not satisfied.lca T ( x (cid:48) , z ) (cid:22) T lca T ( x , x (cid:48) ) and x , x (cid:48) , z ∈ L ( T ( v x )) . Since yz is not an edge in (cid:126) G by assumption and Cor. 3.4 implies σ ( y ) / ∈ σ ( L ( T ( v x )) , there is a leaf z (cid:48) with color σ ( z (cid:48) ) = σ ( z ) such that lca T ( y , z (cid:48) ) ≺ T lca T ( y , z ) . This is only possibleif z (cid:48) ∈ L ( T ( v y )) ∩ L [ σ ( z )] . Therefore, σ ( z ) ∈ σ ( L ( T ( v x ))) ∩ σ ( L ( T ( v y ))) and Lemma 2.17 implies that t µ ( u ) = (cid:3) .In summary, lca T ( x (cid:48) , y ) = lca T ( x , y ) = u and t µ ( u ) = (cid:3) for every tree explaining ( (cid:126) G , σ ) and every possible rec-onciliation map µ from ( T , σ ) to any species tree. Thus both xy and x (cid:48) y are u-fp . Proposition 3.15.
Let ( (cid:126) G , σ ) be a BMG and xy an edge in (cid:126) G with S ∩ ( x , y ) (cid:54) = /0 . Then xy is either the middle edge ofsome good quartet (cid:104) zxyz (cid:48) (cid:105) or the first edge in some ugly quartet (cid:104) xyx (cid:48) z (cid:105) or (cid:104) yxy (cid:48) z (cid:105) .Proof. Let ( T , σ ) be a leaf-colored tree explaining the BMG ( (cid:126) G , σ ) with symmetric part ( G , σ ) . Let v x , v y ∈ child T ( lca T ( x , y )) such that x (cid:22) T v x and y (cid:22) T v y . Since S ∩ ( x , y ) (cid:54) = /0, Lemma 3.12 implies that there is a good quartet (cid:104) z x ∗ y ∗ z (cid:105) with σ ( x ∗ ) = σ ( x ) , σ ( y ∗ ) = σ ( y ) , σ ( z ) = σ ( z ) = t ∈ S ∩ ( x , y ) , x ∗ , z ∈ L ( T ( v x )) and y ∗ , z ∈ L ( T ( v y )) .If x = x ∗ and y = y ∗ we are done. By symmetry it suffices to consider the case x (cid:54) = x ∗ . Before we proceed,we consider the (non-)existence of certain edges in the RBMG G ( T , σ ) and the BMG (cid:126) G ( T , σ ) . By definition ofgood quartets, we have x ∗ z , x ∗ y ∗ , y ∗ z ∈ E ( G ) and Cor. 3.4 implies σ ( x ) , σ ( y ) / ∈ S ∩ ( x , y ) . Hence, σ ( x ∗ ) = σ ( x ) / ∈ σ ( L ( T ( v y ))) and σ ( y ∗ ) = σ ( y ) / ∈ σ ( L ( T ( v x ))) , and thus x ∗ y ∈ E ( G ) and xy ∗ ∈ E ( G ) . Moreover, since lca T ( y , z ) ≺ T lca T ( y , z ) , we have yz / ∈ E ( G ) . Similarly, xz / ∈ E ( G ) . However, σ ( x ) / ∈ σ ( L ( T ( v y ))) implies that lca T ( z , x ) = lca T ( x , y ) (cid:22) lca T ( z , x (cid:48) ) for all x (cid:48) ∈ L [ σ ( x )] and thus, ( z , x ) ∈ E ( (cid:126) G ) . Similarly, ( z , y ) ∈ E ( (cid:126) G ) . Furthermore, we notethat neither x and x ∗ nor y and y ∗ can be adjacent in G or (cid:126) G since σ ( x ) = σ ( x ∗ ) and σ ( y ) = σ ( y ∗ ) .If xz / ∈ E ( G ) , then (cid:104) xyx ∗ z (cid:105) forms an ugly quartet. Now suppose that xz ∈ E ( G ) . Assume that there is anedge yz (cid:48) ∈ E ( G ) with z (cid:48) ∈ L ( T ( v y )) ∩ L [ t ] . Then, lca ( x , z ) ≺ T lca ( x , z (cid:48) ) implies xz (cid:48) / ∈ E ( G ) . Moreover, since σ ( x ) / ∈ σ ( L ( T ( v y ))) we have, by similar arguments as above, that ( z (cid:48) , x ) ∈ E ( (cid:126) G ) . Thus, (cid:104) z (cid:48) yxz (cid:105) forms a good quartet. Finally,if there is no such edge yz (cid:48) ∈ E ( G ) then, in particular, yz / ∈ E ( G ) and y (cid:54) = y ∗ . In this case, (cid:104) yxy ∗ z (cid:105) forms an uglyquartet.The example Fig. 5 shows that the converse of Prop. 3.15 is not true in general.We summarize the results of Prop. 3.6, 3.11 and 3.15 and Prop. 3.14 in the following Observation 3.16.
Let ( (cid:126) G , σ ) be a BMG that contains the edge xy. Then, S ∩ ( x , y ) (cid:54) = /0 implies that xy is either themiddle edge of some good quartet or the first edge of some ugly quartet, which in turn implies that xy is u-fp . All u-fp edges xy with S ∩ ( x , y ) (cid:54) = /0 in ( (cid:126) G , σ ) are therefore completely determined by the middle edges of goodquartets and the first edges of ugly quartets. Furthermore, if xy is the middle edge of a good quartet, then S ∩ ( x , y ) (cid:54) = /0.Therefore, only ugly quartets provide more information about u-fp edges than S ∩ ( x , y ) (cid:54) = /0 as shown in Fig. 5. Onthe other hand, ugly quartets do not convey complete information either. The edge xy in the BMG illustrated inFig. 6(A) is u-fp , but it is not contained in a good, bad or an ugly quartet. S ∩ ( x , y ) = /0 and Hourglasses In this section we turn to the case S ∩ ( x , y ) = /0 and ask how unambiguous false-positive edges that are associatedwith a possibly non-binary duplication node of ( T , σ ) can be identified. To this end, we consider a motive in (cid:126) G ( T , σ ) that is not necessarily part of an induced P . efinition 3.17 (Hourglass) . An hourglass in a proper vertex-colored graph ( (cid:126) G , σ ) , denoted by [ xy (cid:38)(cid:37) x (cid:48) y (cid:48) ] , is asubgraph ( (cid:126) G [ Q ] , σ | Q ) induced by a set of four pairwise distinct vertices Q = { x , x (cid:48) , y , y (cid:48) } ⊆ V ( (cid:126) G ) such that (i) σ ( x ) = σ ( x (cid:48) ) (cid:54) = σ ( y ) = σ ( y (cid:48) ) , (ii) xy and x (cid:48) y (cid:48) are edges in (cid:126) G, (iii) ( x , y (cid:48) ) , ( y , x (cid:48) ) ∈ E ( (cid:126) G ) , and (iv) ( y (cid:48) , x ) , ( x (cid:48) , y ) / ∈ E ( (cid:126) G ) . Note that Condition (i) rules out arcs between x , x (cid:48) and y , y (cid:48) , respectively, i.e., the only arcs in an hourglass are theones specified by Conditions (ii) and (iii). An example is shown in Fig. 6(A). Observation 3.18.
Every hourglass is a BMG since it can be explained by a tree as shown in Fig. 6(B).
First, we note that hourglasses cannot appear in a BMG that can be explained by a binary tree.
Lemma 3.19. If ( (cid:126) G , σ ) is a BMG containing the hourglass [ xy (cid:38)(cid:37) x (cid:48) y (cid:48) ] , then every tree ( T , σ ) that explains ( (cid:126) G , σ ) contains a vertex u ∈ V ( T ) with three distinct children v , v , and v such that x (cid:22) T v , lca T ( x (cid:48) , y (cid:48) ) (cid:22) T v andy (cid:22) T v .Proof. By assumption, xy and x (cid:48) y (cid:48) are edges in (cid:126) G , ( x , y (cid:48) ) , ( y , x (cid:48) ) ∈ E ( (cid:126) G ) , and ( y (cid:48) , x ) , ( x (cid:48) , y ) / ∈ E ( (cid:126) G ) . By Lemma2.8, the informative triples x (cid:48) y (cid:48) | x and x (cid:48) y (cid:48) | y thus must be displayed by every tree ( T , σ ) that explains ( (cid:126) G , σ ) . Thus u x (cid:48) y (cid:48) : = lca T ( x (cid:48) , y (cid:48) ) ≺ T u x : = lca T ( x , u x (cid:48) y (cid:48) ) and u x (cid:48) y (cid:48) ≺ T u y : = lca T ( y , u x (cid:48) y (cid:48) ) . Furthermore, u x and u y are both ancestorsof u x (cid:48) y (cid:48) and thus comparable w.r.t. (cid:22) T . If u x ≺ T u y , then lca T ( x , y (cid:48) ) ≺ T lca T ( x , y ) which implies that xy cannot forman edge in (cid:126) G ; a contradiction. By similar arguments, u y ≺ T u x is not possible and therefore, u x = u y = : u .Since u x (cid:48) y (cid:48) ≺ T u , there are two distinct children v , v ∈ child T ( u ) of u such that x (cid:22) T v and u x (cid:48) y (cid:48) (cid:22) T v . Clearly, y / ∈ L ( T ( v )) since lca T ( y , u x (cid:48) y (cid:48) ) = u (cid:31) T v . We also have y / ∈ L ( T ( v )) since y ∈ L ( T ( v )) would imply lca T ( x , y ) (cid:22) T v ≺ T u = lca T ( x , u x (cid:48) y (cid:48) ) = lca T ( x , y (cid:48) ) , contradicting ( x , y (cid:48) ) ∈ E ( (cid:126) G ) . Together with y ∈ L ( T ( u )) , this implies theexistence of a vertex v ∈ child ( u ) such that v / ∈ { v , v } and y (cid:22) T v .The result shows that hourglasses [ xy (cid:38)(cid:37) x (cid:48) y (cid:48) ] can be used to identify false-positive edges xy with S ∩ ( x , y ) = /0. Proposition 3.20.
If a BMG ( (cid:126) G , σ ) contains an hourglass [ xy (cid:38)(cid:37) x (cid:48) y (cid:48) ] , then the edge xy is u-fp .Proof. According to Lemma 3.19, every tree ( T , σ ) that explains ( (cid:126) G , σ ) contains a vertex u ∈ V ( T ) with threedistinct children v , v , and v such that x (cid:22) T v , lca T ( x (cid:48) , y (cid:48) ) (cid:22) T v and y (cid:22) T v . Thus, u = lca T ( x , y ) and σ ( x ) ∈ σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) . Hence, we can apply Lemma 3.2 to conclude that xy is ( T , σ ) - fp for every tree thatexplains ( (cid:126) G , σ ) . Therefore, the edge xy is u-fp .Prop. 3.20 implies that there are u-fp edges that are not contained in a quartet, since an hourglass (see Fig. 6(A))does not contain a P . We next generalize the concept of hourglasses. Definition 3.21 (Hourglass chain) . An hourglass chain H in a graph ( (cid:126) G , σ ) is a sequence of k ≥ hourglasses [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] such that the following two conditions are satisfied for all i ∈ { ,..., k − } : (H1) y i = x (cid:48) i + and y (cid:48) i = x i + , and (H2) x i y (cid:48) j is an edge in (cid:126) G for all j ∈ { i + ,..., k } A vertex z is called a left (resp., right ) tail of the hourglass chain H if it holds that ( z , x ) ∈ E ( (cid:126) G ) and ( z , x (cid:48) ) / ∈ E ( (cid:126) G ) (resp., ( z , y k ) ∈ E ( (cid:126) G ) and ( z , y (cid:48) k ) / ∈ E ( (cid:126) G ) ). We call H tailed if it has a left or right tail. Note that in contrast to the quartets and the hourglass, an hourglass chain in ( (cid:126) G , σ ) is not necessarily an inducedsubgraph. Observation 3.22. If H = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] be an hourglass chain in ( (cid:126) G , σ ) , then [ x i y i (cid:38)(cid:37) x (cid:48) i y (cid:48) i ] ,..., [ x j y j (cid:38)(cid:37) x (cid:48) j y (cid:48) j ] is an hourglass chain in ( (cid:126) G , σ ) for every ≤ i < j ≤ k. Hourglass chains are “overlapping” hourglasses. The additional condition that x i y (cid:48) j ∈ E ( G ) for all 1 ≤ i < j ≤ k ensures that the two pairs x (cid:48) k , y (cid:48) k and x (cid:48) l , y (cid:48) l with k (cid:54) = l cannot lie in the same subtree below the last common ancestor u which is common to all hourglasses in the chain. Lemma 3.23.
Let H = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] be an hourglass chain in a BMG ( (cid:126) G , σ ) . Then, for every tree ( T , σ ) that explains ( (cid:126) G , σ ) there is a vertex u ∈ V ( T ) with pairwise distinct children v , v ,..., v k , v k + such that itholds x ∈ L ( T ( v )) , y k ∈ L ( T ( v k + )) , and, for all ≤ i ≤ k, we have x (cid:48) i , y (cid:48) i ∈ L ( T ( v i )) .Proof. We prove the statement by induction on k . For the base case k =
1, observe that the hourglass [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] together with Lemma 3.19 implies that there is a vertex u ∈ V ( T ) with pairwise distinct children v , v and v suchthat x (cid:22) T v , lca T ( x (cid:48) , y (cid:48) ) (cid:22) T v (thus x (cid:48) , y (cid:48) (cid:22) T v ) and y (cid:22) T v .Now let k > k hourglasses.Let H = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] be an hourglass chain. By induction hypothesis, for every subsequence H i | : = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x i y i (cid:38)(cid:37) x (cid:48) i y (cid:48) i ] of H with 1 ≤ i < k , which by Observation 3.22 is again an hourglass chain, thestatement is true. A DC x' y'x y x' y' /x y /x' z x y /x' y' /x y k-1 /x' k y' k-1 /x k y' k y k z' ... x yx yx' y' v v v u x x' y' x y' x y x' y x' y' k-1 x k y k-1 x' k y k z'y' k v v v v v k-1 v k v k+1 z ... u Figure 6:
A: Hourglass. B: Visualization of Lemma 3.19. C: Hourglass chain with left tail z and right tail z (cid:48) foran odd number of hourglasses in the chain. Edges of the form x i y (cid:48) j ∈ E ( G ) are only shown for x , the others areomitted. An hourglass chain H is a subgraph but not necessarily induced and thus additional arcs may exist. Inparticular, the elements e ∈ { x y k , zy k , x z (cid:48) , zz (cid:48) } are not necessarily edges in an hourglass chain. However, wheneverthey exist, they are u-fp (cf. Lemma 3.25). Moreover, each single hourglass in H is an induced subgraph of theBMG; by definition, therefore, there are no arcs ( z , x (cid:48) ) or ( z (cid:48) , y (cid:48) k ) . Note, σ ( z ) (cid:54) = σ ( z (cid:48) ) is possible. D: Visualizationof Lemmas 3.23 and 3.24.Consider the subsequence H i | with i = k −
1. By assumption, there is a vertex u ∈ V ( T ) with pairwise distinctchildren v , v ,..., v i , v i + such that it holds x ∈ L ( T ( v )) , y i ∈ L ( T ( v i + )) , and, for all 1 ≤ j ≤ i , we have x (cid:48) j , y (cid:48) j ∈ L ( T ( v j )) . The hourglass [ x i + y i + (cid:38)(cid:37) x (cid:48) i + y (cid:48) i + ] and Lemma 3.19 imply the existence of a vertex u (cid:48) ∈ V ( T ) withpairwise distinct children v (cid:48) i , v (cid:48) i + and v (cid:48) i + such that x i + (cid:22) T v (cid:48) i , lca T ( x (cid:48) i + , y (cid:48) i + ) (cid:22) T v (cid:48) i + and y i + (cid:22) T v (cid:48) i + . By thedefinition of hourglass chains, we have y i = x (cid:48) i + and y (cid:48) i = x i + . Therefore, u (cid:48) = lca T ( x (cid:48) i + , x i + ) = lca T ( y i , y (cid:48) i ) = u .Since v i and v (cid:48) i are both children of u , y (cid:48) i = x i + and it holds both that y (cid:48) i (cid:22) T v i and x i + (cid:22) T v (cid:48) i , we conclude that v i = v (cid:48) i .Similarly, it holds v i + = v (cid:48) i + since v i + , v (cid:48) i + ∈ child ( u ) and y i = x (cid:48) i + . In particular, we have v (cid:48) i + (cid:54) = v (cid:48) i + = v i + and v (cid:48) i + (cid:54) = v (cid:48) i = v i . It remains to show that v (cid:48) i + (cid:54) = v j for 0 ≤ j < i . Assume, for contradiction, that v (cid:48) i + = v j forsome fixed j with 0 ≤ j < i . By assumption, x (cid:22) T v j if j =
0, and otherwise, x j + = y (cid:48) j (cid:22) T v j . Moreover, since v (cid:48) i + = v j , we have y i + (cid:22) T v j . Hence, lca T ( x j + , y i + ) (cid:22) T v j . Furthermore, since y (cid:48) i + (cid:22) T v i + (cid:54) = v j , it holdslca T ( x j + , y (cid:48) i + ) = u (cid:31) T v j . Since σ ( y i + ) = σ ( y (cid:48) i + ) by the definition of hourglasses, the latter two argumentscontradict x j + y (cid:48) i + ∈ E ( G ) , which must hold by the definition of hourglass chains. Hence, we can conclude that v (cid:48) i + (cid:54) = v j for and 0 ≤ j < i and we set v i + : = v (cid:48) i + . In summary, the statement holds for the hourglass chain H i + | = H .It is straightforward to generalize the latter statement to tailed hourglass chains. Lemma 3.24.
Let H = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] be an hourglass chain with left (resp. right) tail z in a BMG ( (cid:126) G , σ ) . Then, every tree ( T , σ ) that explains ( (cid:126) G , σ ) contains a vertex u ∈ V ( T ) with pairwise distinct childrenv , v ,..., v k , v k + such that it holds x ∈ L ( T ( v )) , y k ∈ L ( T ( v k + )) , and, for all ≤ i ≤ k, we have x (cid:48) i , y (cid:48) i ∈ L ( T ( v i )) .Furthermore, we have z (cid:22) T v (resp. z (cid:22) T v k + ).Proof. By Lemma 3.23, there is a vertex u ∈ V ( T ) with pairwise distinct children v , v ,..., v k , v k + such that itholds x ∈ L ( T ( v )) , y k ∈ L ( T ( v k + )) , and, for all 1 ≤ i ≤ k , we have x (cid:48) i , y (cid:48) i ∈ L ( T ( v i )) .Suppose that z is a left tail of H . We need to show that z (cid:22) T v . By definition, ( z , x ) ∈ E ( (cid:126) G ) , ( z , x (cid:48) ) / ∈ E ( (cid:126) G ) , and σ ( x ) = σ ( x (cid:48) ) . Therefore, zx | x (cid:48) is an informative triple for T , and hence lca T ( z , x ) ≺ T lca T ( z , x (cid:48) ) = lca T ( x , x (cid:48) ) = u . Since v is the unique child of u with x ≺ T v , we can conclude that lca T ( z , x ) (cid:22) T v and thus, z (cid:22) T v .If z is a right tail of H , a similar argument using the informative triple z (cid:48) y k | y (cid:48) k , which must be displayed by T because ( z , y k ) ∈ E ( (cid:126) G ) and ( z , y (cid:48) k ) / ∈ E ( (cid:126) G ) , implies z (cid:22) T v k + . e are now in the position to show that hourglass chains identify additional u-fp edges that are not contained in asingle hourglass. Lemma 3.25.
Let H = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] be an hourglass chain in ( (cid:126) G , σ ) , possibly with a left tail z ora right tail z (cid:48) . Then every edge e ∈ { x y k , zy k , x z (cid:48) , zz (cid:48) } ∩ E ( G ) is u-fp , where G denotes the symmetric part of (cid:126) G.Proof.
Let ( T , σ ) be an arbitrary tree that explains ( (cid:126) G , σ ) . By the definition of hourglass chains, we have k ≥ [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] . Since H = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] in (cid:126) G ( T , σ ) , Lemma 3.24 implies the existence of a vertex u ∈ V ( T ) with pairwise distinct children v , v ,..., v k , v k + such that it holds x ∈ L ( T ( v )) , y k ∈ L ( T ( v k + )) , and, for all 1 ≤ i ≤ k , we have x (cid:48) i , y (cid:48) i ∈ L ( T ( v i )) . Furthermore, thislemma also implies z (cid:22) T v if z is a left tail of H , and z (cid:48) (cid:22) T v k + if z (cid:48) is a right tail of H . Note that lca T ( x , x (cid:48) ) = u ,and x and x (cid:48) lie below distinct children of u . More precisely x (cid:22) T v and x (cid:48) (cid:22) T v . Since σ ( x ) = σ ( x (cid:48) ) , we have σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0. Moreover, lca T ( a , b ) = u for every edge e = ab in (cid:126) G that coincides with one of x y k , zy k , x z (cid:48) , and zz (cid:48) . The latter two arguments together with Lemma 3.2 imply that every such edge is ( T , σ ) - fp .Since ( T , σ ) was chosen arbitrarily, every such edge is also u-fp .It is important to note that the construction of hourglass chains does not imply that an edge e ∈ { x y k , zy k , x z (cid:48) , zz (cid:48) } must exist in ( (cid:126) G , σ ) . Nevertheless, whenever such an edge occurs, it is u-fp . We will take a closer look at theproperties of hourglass chains in Section 5. Finding hourglass chains in ( (cid:126) G , σ ) is closely related to the NP-completeS UBGRAPH I SOMORPHISM problem [11], and hence a difficult endeavor in practice. In the following section weshall see, however, the identification of u-fp edges does not require the explicit enumeration of hourglass chains.
So far we focused on the set S ∩ ( x , y ) for individual edges xy and induced subgraphs of BMGs to identify u-fp edges.This has lead to several sufficient conditions. We now shift our point of view and consider the color allocation to thesubtrees below each vertex of a tree explaining a given BMG. This leads us to the idea of a color intersection graph. Definition 4.1.
The color-set intersection graph C T ( u ) of an inner vertex u of a leaf-colored gene tree ( T , σ ) is theundirected graph with vertex set V : = child T ( u ) and edge setE : = { v v | v , v ∈ V , v (cid:54) = v and σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0 } . This construction is similar to the definition of intersection graphs e.g. used in [34]. C T ( u ) can be viewed as anatural generalization of S ∩ ( x , y ) in the following sense: if u = lca T ( x , y ) is a binary vertex, then C T ( u ) = K iff S ∩ ( x , y ) (cid:54) = /0 and therefore, C T ( u ) = K ∪ K iff S ∩ ( x , y ) = /0. In the non-binary case, there is an edge v v iff S ∩ ( x , y ) (cid:54) = /0 for some x ∈ L ( T ( v )) and y ∈ L ( T ( v )) . Shortest paths in the color-set intersection graphs will playan important role in identifying many u-fp edges. Lemma 4.2.
Let v and v k be two distinct vertices in the same connected component of the color-set intersectiongraph C T ( u ) of a leaf-colored gene tree ( T , σ ) , and let P ( v , v k ) = ( v ,..., v k ) be a shortest path in C T ( u ) connectingv and v k . Then σ ( L ( T ( v i ))) ∩ σ ( L ( T ( v j ))) = /0 for all i and j satisfying ≤ i < i + ≤ j ≤ k.Proof. Assume, for contradiction, that σ ( L ( T ( v i ))) ∩ σ ( L ( T ( v j ))) (cid:54) = /0 for some i , j with 1 ≤ i < i + ≤ j ≤ k . Thenthe edge v i v j must be contained in C T ( u ) , contradicting the fact that P ( v , v k ) is a shortest path. Lemma 4.3.
Let ( (cid:126) G , σ ) be a BMG that is explained by ( T , σ ) and suppose that x , y ∈ L ( T ) are two distinct leaveswith u : = lca T ( x , y ) and v x , v y ∈ child T ( u ) such that (i) x (cid:22) T v x and y (cid:22) T v y , and (ii) there is a shortest path ( v x = v , v ,..., v k , v k + = v y ) of length at least two in C T ( u ) . Then there is an hourglass chain H = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] in ( (cid:126) G , σ ) . In particular, precisely one of the following conditions is satisfied:1. x = x and y k = y;2. y k = y and z : = x is a left tail of H ;3. x = x and z (cid:48) : = y is a right tail of H ; or4. z : = x is a left tail and z (cid:48) : = y is a right tail of H .Proof. Lemma 4.2 implies S ∩ ( x , y ) = σ ( L ( T ( v x ))) ∩ σ ( L ( T ( v y ))) = σ ( L ( T ( v ))) ∩ σ ( L ( T ( v k + ))) = /0. We pro-ceed by showing that the BMG (cid:126) G ( T , σ ) contains an hourglass chain H = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] possiblywith left tail z and right tail z (cid:48) such that one of the Conditions 1–4 is satisfied.We first consider the two cases: either (A) σ ( x ) ∈ σ ( L ( T ( v ))) or (B) σ ( x ) / ∈ σ ( L ( T ( v ))) . In Case (A), we set x : = x and c : = σ ( x ) . In Case (B), we set z : = x , choose c ∈ σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) arbitrarily (note v v forms n edge in C T ( u ) and thus, the latter intersection is non-empty) and we set x = v for some v ∈ L ( T ( v )) ∩ L [ c ] suchthat lca ( v , x ) (cid:22) T lca T ( v (cid:48) , x ) (cid:22) T v for all v (cid:48) ∈ L ( T ( v )) ∩ L [ c ] . Clearly, such a vertex v exists. Moreover, c (cid:54) = σ ( x ) and we obtain ( x , v ) = ( z , x ) ∈ E ( (cid:126) G ) as necessary requirement for left tails. In summary, we have in Case (A) x = x and in Case (B) x plays the role of the left tail z and x is some other vertex. Moreover, in both Cases (A) and (B), wehave σ ( x ) = c ∈ σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) .We now consider the “other end” of the hourglass chain, that is, vertex y k and the possible right tail. Again,we have two cases: either (A’) σ ( y ) ∈ σ ( L ( T ( v k + ))) or (B’) σ ( y ) / ∈ σ ( L ( T ( v k + ))) . In Case (A’), we set y k : = y and c k : = σ ( y ) . In Case (B’), we set z (cid:48) : = y , and , by similar arguments as in Case (A) and (B), we can choose c k ∈ σ ( L ( T ( v k ))) ∩ σ ( L ( T ( v k + ))) arbitrarily and set y k = w for some vertex w ∈ L ( T ( v k + )) ∩ L [ c k ] such that ( y , w ) = ( z (cid:48) , y k ) ∈ E ( (cid:126) G ) as a necessary requirement for right tails. Again, for both cases (A’) and (B’) we have σ ( y k ) = c k ∈ σ ( L ( T ( v k ))) ∩ σ ( L ( T ( v k + ))) .We continue by picking an arbitrary color c i from σ ( L ( T ( v i ))) ∩ σ ( L ( T ( v i + ))) for each 1 ≤ i < k . This ispossible because v i v i + ∈ E ( C T ( u )) , and thus σ ( L ( T ( v i ))) ∩ σ ( L ( T ( v i + ))) (cid:54) = /0. Note that now c i ∈ σ ( L ( T ( v i ))) ∩ σ ( L ( T ( v i + ))) holds for all 0 ≤ i ≤ k . In particular, the colors c , c ,..., c k are pairwise distinct. To see this, assume,for contradiction, that c i = c j for some i , j with i < j . Then c i ∈ σ ( L ( T ( v i ))) and c i = c j ∈ σ ( L ( T ( v j + ))) whichimplies c i ∈ σ ( L ( T ( v i ))) ∩ σ ( L ( T ( v j + ))) . This contradicts Lemma 4.2 for j + ≥ i + ≤ i ≤ k , we have c i − , c i ∈ σ ( L ( T ( v i ))) . Thus Lemma 2.3 ensures the existence of vertices x (cid:48) i ∈ L ( T ( v i )) ∩ L [ c i − ] and y (cid:48) i ∈ L ( T ( v i )) ∩ L [ c i ] that form an edge x (cid:48) i y (cid:48) i in (cid:126) G . By assumption we have x (cid:48) i y (cid:48) i ∈ E ( G ) for all1 ≤ i ≤ k since [ x i y i (cid:38)(cid:37) x (cid:48) i y (cid:48) i ] is an hourglass. We already set x and y k . We furthermore set x i : = y (cid:48) i − for all 1 < i ≤ k ,and y i : = x (cid:48) i + for all 1 ≤ i < k . Thus ensures that (H1) in Def. 3.21 is satisfied. Moreover, since σ ( x ) = c = σ ( x (cid:48) ) and σ ( x i ) = σ ( y (cid:48) i − ) = c i − for all 1 < i ≤ k , we have σ ( x i ) = c i − = σ ( x (cid:48) i ) for all 1 ≤ i ≤ k . Similar argumentsimply σ ( y i ) = c i = σ ( y (cid:48) i ) for all 1 ≤ i ≤ k .We next show that the induced subgraph (cid:126) G [ x i , x (cid:48) i , y i , y (cid:48) i ] is an hourglass for 1 ≤ i ≤ k and thus x i y (cid:48) j is an edge in (cid:126) G for all i < j ≤ k . We also know, by construction, that x (cid:48) i y (cid:48) i is an edge in (cid:126) G .Independent of whether x was constructed based on the cases (A) or (B), we have x i (cid:22) T v if i = x i = y (cid:48) i − (cid:22) T v i − otherwise. Thus x i (cid:22) T v i − . Likewise, independent of whether y k was constructed based on the cases(A’) or (B’), we have y i (cid:22) T v k + if i = k and y i = x (cid:48) i + (cid:22) T v i + otherwise. Thus y i (cid:22) T v i + . In summary, we have x i (cid:22) T v i − ; x (cid:48) i , y (cid:48) i (cid:22) T v i ; and y i (cid:22) T v i + for all i ∈ { ,..., k } . This implies lca T ( x i , y (cid:48) i ) = lca T ( x i , y i ) = lca T ( x (cid:48) i , y i ) = u .Since i + ≥ ( i − ) + P ( v , v k + ) is a shortest path, Lemma 4.2 implies σ ( L ( T ( v i − ))) ∩ σ ( L ( T ( v i + ))) = /0.From σ ( x i ) ∈ σ ( L ( T ( v i − ))) and σ ( y i ) ∈ σ ( L ( T ( v i + ))) we obtain σ ( x i ) / ∈ σ ( L ( T ( v i + ))) and σ ( y i ) / ∈ σ ( L ( T ( v i − ))) . Thus, there is no ˜ y such that σ ( ˜ y ) = σ ( y (cid:48) i ) = σ ( y i ) and lca T ( x i , ˜ y ) ≺ T u = lca T ( x i , y (cid:48) i ) = lca T ( x i , y i ) ,and no ˜ x such that σ ( ˜ x ) = σ ( x (cid:48) i ) = σ ( x i ) and lca T ( y i , ˜ x ) ≺ T u = lca T ( y i , x (cid:48) i ) = lca T ( y i , x i ) . Hence, (cid:126) G contains the arcs ( x i , y (cid:48) i ) , ( x i , y i ) , ( y i , x i ) and ( y i , x (cid:48) i ) . Moreover, x i y i is an edge in (cid:126) G . However, since σ ( x (cid:48) i ) = σ ( x i ) and lca T ( x (cid:48) i , y (cid:48) i ) (cid:22) T v i ≺ T u = lca T ( x i , y (cid:48) i ) we conclude ( y (cid:48) i , x i ) / ∈ E ( (cid:126) G ) . Likewise, σ ( y (cid:48) i ) = σ ( y i ) and lca T ( x (cid:48) i , y (cid:48) i ) (cid:22) T v i ≺ T u = lca T ( x (cid:48) i , y i ) imply that ( x (cid:48) i , y i ) / ∈ E ( (cid:126) G ) . In summary, (cid:126) G [ x i , x (cid:48) i , y i , y (cid:48) i ] = [ x i y i (cid:38)(cid:37) x (cid:48) i y (cid:48) i ] is an hourglass, for all i ∈ { ,..., k } , and x i (cid:22) T v i − and y (cid:48) j (cid:22) T v j for all 1 ≤ i < j ≤ k .Since j ≥ ( i − )+ P ( v , v k + ) is a shortest path, Lemma 4.2 implies that σ ( L ( T ( v i − ))) ∩ σ ( L ( T ( v j ))) = /0.Thus, there is no ˜ y such that σ ( ˜ y ) = σ ( y (cid:48) j ) and lca T ( x i , ˜ y ) ≺ T u = lca T ( x i , y (cid:48) j ) , and no ˜ x such that σ ( ˜ x ) = σ ( x i ) andlca T ( y (cid:48) j , ˜ x ) ≺ T u = lca T ( y (cid:48) j , x i ) . This implies that ( x i , y (cid:48) j ) ∈ E ( (cid:126) G ) and ( y (cid:48) j , x i ) ∈ E ( (cid:126) G ) , respectively. Therefore x i y (cid:48) j isan edge in (cid:126) G for 1 ≤ i < j ≤ k . In summary, (H2) of in Def. 3.21 is always satisfied.Hence, if x and y are constructed based on Case (A) and (A’), respectively, we are done.It remains to show that z and z (cid:48) are a left and a right tail, resp., of the hourglass chain in Case (B) or (B’).First assume Case (B), and thus z = x . We have z , x (cid:22) T v by construction and ( z , x ) ∈ E ( (cid:126) G ) as shown above.Together with x (cid:48) (cid:22) T v , this implies that lca T ( z , x ) (cid:22) T v ≺ T u = lca T ( z , x (cid:48) ) . Using σ ( x ) = σ ( x (cid:48) ) we thereforeobtain ( z , x (cid:48) ) / ∈ E ( (cid:126) G ) . and hence z is a left tail of the constructed hourglass chain. Now assume Case (B’), andthus, z (cid:48) = y . We have z (cid:48) , y k (cid:22) T v k + and ( z (cid:48) , y k ) ∈ E ( (cid:126) G ) by construction. Together with y (cid:48) k (cid:22) T v k this implieslca T ( z (cid:48) , y k ) (cid:22) T v k + ≺ T u = lca T ( z (cid:48) , y (cid:48) k ) . Using σ ( y k ) = σ ( y (cid:48) k ) , we obtain ( z (cid:48) , y (cid:48) k ) / ∈ E ( (cid:126) G ) and hence z (cid:48) is a right tailof the constructed hourglass chain.In summary, H = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] is an hourglass chain, possibly with left tail z and right tail z (cid:48) .Furthermore, precisely one of the Conditions 1–4 in the statement holds by construction.Lemma 4.3 establishes a close connection between color-set intersection graphs and hourglass chains, which wewill use below to simplify the identification of the corresponding u-fp edges. To this end, we first consider propertiesin relation of a tree T explaining ( (cid:126) G , σ ) that are common to the three types of u-fp edges we have encountered so far. Definition 4.4.
An edge xy in a vertex-colored graph ( (cid:126) G , σ ) is a hug-edge if it satisfies at least one of the followingconditions: C1) xy is the middle edge of a good quartet in ( (cid:126) G , σ ) ; (C2) xy is the first edge of an ugly quartet in ( (cid:126) G , σ ) ; or (C3) there is an hourglass chain H = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] in ( (cid:126) G , σ ) , and one of the following casesholds:1. x = x and y k = y;2. y k = y and z : = x is a left tail of H ;3. x = x and z (cid:48) : = y is a right tail of H ; or4. z : = x is a left tail and z (cid:48) : = y is a right tail of H . The term hug -edge refers to the fact xy is a particular edge of an h ourglass-chain, an u gly quartet, or a g oodquartet. Theorem 4.5.
An edge xy in (cid:126) G ( T , σ ) with u : = lca T ( x , y ) , v x , v y ∈ child T ( u ) , x (cid:22) T v x , and y (cid:22) T v y is a hug-edge ifv x and v y belong to the same connected component of C T ( u ) . Moreover, every hug-edge is u-fp .Proof. We show first that xy satisfies one of the Conditions (C1), (C2), or ((C3), and hence is hug-edge. First, notethat v x (cid:54) = v y . Moreover, Lemma 2.4 implies σ ( x ) / ∈ σ ( L ( T ( v y ))) and σ ( y ) / ∈ σ ( L ( T ( v x ))) . Since by assumption v x , v y belong to the same connected component, there is a shortest path P : = ( v x = v ,..., v k + = v y ) in C T ( u ) . For k = v x v y ∈ E ( C T ( u )) . This implies S ∩ ( x , y ) = σ ( L ( T ( v x ))) ∩ σ ( L ( T ( v y ))) (cid:54) = /0. By Prop. 3.15, the edge xy is eitherthe middle edge of a good quartet or the first edge of an ugly quartets in ( (cid:126) G , σ ) . Hence, Condition (C1) or (C2) issatisfied. If k >
0, Lemma 4.3 implies Condition (C3).For each of the three cases we have already shown that xy is u-fp : For (C1) Prop. 3.11 applies, for (C2) Prop. 3.14provides the desired result, and for (C3) we use Lemma 3.25. Lemma 4.6.
If the BMG (cid:126) G ( T , σ ) contains a hug-edge xy in a BMG (cid:126) G ( T , σ ) , then there are distinct vertices v , v ∈ child T ( lca T ( x , y )) such that σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0 .Proof. Let xy be a hug-edge in the BMG ( (cid:126) G , σ ) = (cid:126) G ( T , σ ) , i.e. one of (C1), (C2), or (C3) applies.If e = xy satisfies (C1), then xy is the middle edge of a good quartet (cid:104) zxyz (cid:48) (cid:105) in ( (cid:126) G , σ ) . By [14, Lemma 36], there isa vertex u : = lca T ( x , y , z , z (cid:48) ) such that x , z (cid:22) T v and y , z (cid:48) (cid:22) T for some distinct v , v ∈ child T ( u ) . Thus, u = lca T ( x , y ) .Moreover, since σ ( z ) = σ ( z (cid:48) ) , we have σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0 for two distinct vertices v , v ∈ child T ( u ) .If e = xy satisfies (C2), then it is the first edge of some ugly quartet, which w.l.o.g. has the form (cid:104) xyx (cid:48) z (cid:105) . Re-using the arguments in the proof of Prop. 3.14 shows that there must be two distinct children v and v of vertex u = lca T ( x , y ) such that σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0.If e = xy satisfies (C3), then there is a (tailed) hourglass chain H = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] , k ≥
1, in (cid:126) G ( T , σ ) , such that either x = x or z : = x is a left tail of H , and either y = y k or z (cid:48) : = y is a right tail of H . In eithercase, Lemma 3.24 implies x (cid:22) T v and y (cid:22) T v k + . Since x and x (cid:48) lie below distinct children v and v of vertexlca T ( x , y ) and σ ( x ) = σ ( x (cid:48) ) by the definition of hourglasses, it holds that σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0.In each case, therefore, there are distinct vertices v , v ∈ child T ( lca T ( x , y )) such that σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0.The fact that all hug-edges are u-fp by Thm. 4.5 suggests to consider the subgraph of a BMG that is left afterremoving all these unambiguously recognizable false-positive orthology assignments. Definition 4.7.
Let ( (cid:126) G , σ ) be a BMG with symmetric part G and let F be the set of its hug-edges. The no-hug graph NH ( (cid:126) G , σ ) is the subgraph of G with vertex set V ( (cid:126) G ) , coloring σ and edge set E ( G ) \ F. The NH ( (cid:126) G , σ ) is therefore the subgraph of the underlying RBMG of (cid:126) G that contains all edges that cannot beidentified as u-fp by using only good quartets, ugly quartets and (tailed) hourglass chains as outlined in Thm. 4.5. Corollary 4.8.
Let ( T , σ ) be a leaf-colored tree and µ a reconciliation map from ( T , σ ) to some species tree S. Then, Θ ( T , t µ ) ⊆ Θ ( T , (cid:98) t T ) ⊆ NH ( (cid:126) G ( T , σ )) ⊆ (cid:126) G ( T , σ ) . Proof.
By Thm. 2.21, Θ ( T , t µ ) ⊆ Θ ( T , (cid:98) t T ) ⊆ (cid:126) G ( T , σ ) ; and by definition, we have NH ( (cid:126) G ( T , σ )) ⊆ (cid:126) G ( T , σ ) . Now, let xy be an edge in Θ ( T , (cid:98) t T ) and thus, (cid:98) t T ( lca T ( x , y )) = (cid:32) . By definition of (cid:98) t T , we have σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) = /0for any two distinct v , v ∈ child T ( lca T ( x , y )) . The contraposition of Lemma 4.6 implies that xy is not a hug-edgeand thus an edge of NH ( (cid:126) G ( T , σ )) , which completes the proof.The no-hug graph still may contain false positive orthology assignments, i.e., NH ( (cid:126) G ( T , σ )) = Θ ( T , (cid:98) t T ) does nothold in general. As an example, consider the BMG (cid:126) G ( T , σ ) in Fig. 3. Here, none of the edges xz , x (cid:48) z and yz are u-fp and thus, by Thm. 4.5 also not hug-edges. Hence, they still remain in NH ( (cid:126) G ( T , σ )) . However, these edges are notcontained in Θ ( T , (cid:98) t T ) , since (cid:98) t T ( lca T ( x , x (cid:48) , y , z )) = (cid:3) and thus, Θ ( T , (cid:98) t T ) (cid:40) NH ( (cid:126) G ( T , σ )) . In the following sectionwe shall see that there are, however, no u-fp edges left in the no-hug graph. a good advice in the time of SARS-CoV-2 b a cv b a b b ca v v T T * S a b a c b G(T, σ )~ ~ Figure 7:
A “true” scenario, that is, an event-labeled gene tree ( ˜ T , ˜ t , σ ) embedded into a species tree (left). Toobtain the least resolved tree ( T ∗ , σ ) of (cid:126) G ( ˜ T , σ ) , the edge v v has been contracted into vertex v . The BMG (cid:126) G ( ˜ T , σ ) does not contain any u-fp edge. See text for further explanations.
Every BMG ( (cid:126) G , σ ) contains all information necessary to determine the trees ( T , σ ) by which it is explained. Since u-fp edges are defined in terms of the explaining trees, every BMG ( (cid:126) G , σ ) also contains – at least implicitly – allinformation needed to identify its u-fp edges. Since ( (cid:126) G , σ ) is determined by its unique least resolved tree ( T ∗ , σ ) , the u-fp edges must also be determined by ( T ∗ , σ ) . It is not sufficient for this purpose, however, to find an event labeling t of the vertices of T ∗ .To see this, consider for example the “true” history ( ˜ T , ˜ t , σ ) of the BMG (cid:126) G ( ˜ T , σ ) as shown in Fig. 7. The uniqueleast resolved tree ( T ∗ , σ ) for (cid:126) G ( ˜ T , σ ) is obtained by merging the two vertices v and v of ˜ T resulting in the vertex v of T ∗ . We have ˜ t ( v ) = (cid:32) (cid:54) = (cid:3) = ˜ t ( v ) . For vertex v and every reconciliation map µ from ( T ∗ , σ ) to any species tree S it must hold µ ( v ) ∈ E ( S ) and thus t ∗ µ ( v ) = (cid:3) , since v has two children with overlapping color sets and by Lemma2.17. Thus, the edges cx with x ∈ { a , a , b , b } are ( T ∗ , σ ) - fp although they are not false positives at all. Sincespeciation and duplication vertices may be merged into the same vertex v of T ∗ , the least resolved tree T ∗ in generalcannot simply inherit the event labeling from the true gene history, and thus there may not be a “correct” labeling t ∗ of T ∗ that provides evidence for all u-fp edges.The example in Fig. 7 shows that the least resolved tree T ∗ simply may not be “resolved enough”. In the following,we therefore describe how the unique least resolved tree can be resolved further to provide more evidence about u-fp edges. Eventually, this will lead us to a characterization of the u-fp edges. To this end, we need to gain more insightsinto the structure of redundant edges, i.e., those edges e in T for which ( T e , σ ) still explains (cid:126) G ( T , σ ) .Since the color sets of distinct subtrees below a speciation vertex cannot overlap by Lemma 2.17, Cor. 2.11 impliesthat all edges below a speciation vertex are redundant and thus can be contracted. More precisely, we have Observation 4.9.
Let µ be a reconciliation map from ( T , σ ) to S and assume that there is a vertex u ∈ V ( T ) suchthat µ ( u ) ∈ V ( S ) and thus, t µ ( u ) = (cid:32) . Then every inner edge uv with v ∈ child T ( u ) is redundant w.r.t. (cid:126) G ( T , σ ) .Moreover, if an inner edge uv with v ∈ child T ( u ) is non-redundant, then u must have two children with overlappingcolor sets, and hence, t µ ( u ) = (cid:3) . Our goal is to identify those vertices in ( T ∗ , σ ) that can be expanded to yield a tree that still explains (cid:126) G ( T ∗ , σ ) .To this end, we need to introduce a particular way of “augmenting” a leaf-colored tree. Definition 4.10.
Let ( T , σ ) be a leaf-colored tree, u be an inner vertex of T , C T ( u ) the corresponding color-setintersection graph, and C the set of connected components of C T ( u ) . Then the tree T u augmented at vertex u isobtained by applying the following editing steps to T : • If C T ( u ) is connected, do nothing. • Otherwise, for each C ∈ C with | C | > – introduce a vertex w and attach it as a child of u, i.e., add the edge uw, – for every element v i ∈ C, substitute the edge uv i by the edge wv i .The augmentation step is trivial if T u = T , in which case we say that no edit step was performed . An example of an augmentation is shown in Fig. 8. It is easy to see that the tree T u obtained by an augmentationof a phylogenetic tree T is again a phylogenetic tree. The augmentation step at vertex u of T is trivial if and onlyif either C T ( u ) is connected or all connected components C ∈ C are singletons, i.e., | C | =
1. If ( T u , σ ) is obtainedby augmenting ( T , σ ) at node u , we denote the set of newly introduced vertices by V ¬ T : = V ( T u ) \ V ( T ) . Note that V ¬ T = /0 whenever no edit step was performed.Since augmentation only inserts vertices between u and its children, it affects neither L ( T ( u )) nor L ( T ( v )) for v ∈ child ( u ) . As an immediate consequence we find v v v v v u v v v v v v v v v uw w T T u Figure 8:
Left, a (part of a) leaf-colored tree ( T , σ ) . The tree ( T u , σ ) on the right is obtained from ( T , σ ) by aug-menting T at vertex u . The color-set intersection graph C T ( u ) (shown in the middle) has more than one connectedcomponent and there are connected components consisting of more than two vertices v i ∈ child T ( u ) . Accordingto Lemma 4.12, σ ( L ( T u ( v ))) ∩ σ ( L ( T u ( v (cid:48) ))) = /0 for any two distinct vertices v , v (cid:48) ∈ child T u ( u ) = { v , w , w } . ByCor. 2.11, the edges uw and uw are redundant w.r.t. (cid:126) G ( T u , σ ) and thus, both trees explain the same BMG. Observation 4.11.
Let ( T , σ ) be a leaf-colored tree, u (cid:54) = v two inner vertices of T , C T ( u ) the corresponding color-setintersection graph, and ( T u , σ ) the tree obtained by augmenting T at u. Then C T u ( v ) = C T ( v ) . Lemma 4.12.
Let ( T , σ ) be a leaf-colored tree. Let u ∈ V ( T ) and T u be the tree after augmenting T at vertex u. If C T ( u ) is unconnected, then σ ( L ( T u ( w ))) ∩ σ ( L ( T u ( w ))) = /0 for any two distinct vertices w , w ∈ child T u ( u ) .Proof. By construction, the vertex w i in T u , i = ,
2, is either a child of u in T or was inserted in the augmentationstep. Therefore, the two connected components C and C of C T ( u ) to which w and w belong are disjoint. Thus σ ( L ( T ( v i ))) ∩ σ ( L ( T ( v j ))) = /0 for all v i , v j ∈ child T ( u ) with v i ∈ C and v j ∈ C because otherwise there would bean edge v i v j in C T ( u ) and thus, C = C . Since w i is either the single vertex in C i or w i has as children the vertices of C i in T u , i ∈ { , } , we conclude that σ ( L ( T u ( w ))) ∩ σ ( L ( T u ( w ))) = /0.The following result shows that no further edit step can be performed at vertices that have been newly introducedby a former augmentation step or have already undergone an augmentation. Lemma 4.13.
Let ( T , σ ) be a leaf-colored tree, u ∈ V ( T ) , ( T u , σ ) the tree obtained by augmenting T at u, anddenote by ( T uw , σ ) the tree obtained by augmenting T u at w. Then T uw = T u for all w ∈ V ¬ T ∪ { u } .Proof. If T u = T , then V ¬ T = /0 and thus T uu = T u = T . If T u (cid:54) = T , then the definition of the augmentation stepat u implies that either C T u ( u ) is connected or all connected components of C T u ( u ) are singletons. In either caseLemma 4.12 ensured that augmentation at u leaves T u unchanged, i.e., T uu = T u . By construction, C T u ( w ) is connectedfor w ∈ V ¬ T \ { u } and thus, we have T uw = T u .We now show that the application of all possible non-trivial augmentation steps in some tree ( T , σ ) finally leadsto a unique tree ( A ( T ) , σ ) . It can be computed according to Algorithm 1. Lemma 4.14.
For every leaf-colored tree ( T , σ ) there is a unique tree ( A ( T ) , σ ) obtained from ( T , σ ) by repeatedapplication of augmentation steps until only trivial augmentation steps remain. The tree ( A ( T ) , σ ) is computed byAlgorithm 1.Proof. Lemma 4.13 together with Observation 4.11 implies that (i) every vertex u in T can be non-trivially augmentedat most once, (ii) the newly introduced vertices cannot be non-trivially augmented at all, and (iii) augmentation of twodistinct inner vertices of T yields the same result irrespective of the order of the augmentation steps. Thus, ( A ( T ) , σ ) is unique. The correctness of Algorithm 1 now follows immediately. Lemma 4.15.
Alg. 1 with input T = ( V , E ) and σ runs in O ( | V | | S | ) time and O ( | V | ) space, where S = σ ( L ( T )) isthe set of species under consideration.Proof. Assigning the color set L ( T ( u )) to each u requires O ( | V || S | ) time, where | S | < | V | . The total effort to constructall C T ( u ) is bounded by O ( | V | | S | ) , corresponding to comparing the color sets of all pairs of vertices of T . The totalsize of all color-set intersection graphs in O ( | V | ) . Computation of the connected components is linear in the size ofthe graph, which also bounds the editing effort for each u , implying the claim.We close this section by showing that augmentation does not affect the underlying BMG and thus, the unique treeobtained by Alg. 1 still explains the same BMG. lgorithm 1: Augmented tree
Data:
Leaf-colored phylogenetic tree ( T , σ ) Result:
Augmented tree ( A ( T ) , σ ) foreach u ∈ V ( T ) in pre-order do Compute C T ( u ) . C ← set of connected components of C T ( u ) if | C | > then foreach C ∈ C such that | C | > do Introduce a vertex w and the edge uw . foreach v i ∈ C do Remove the edge uv i . Add the edge wv i . end end end Proposition 4.16.
For every leaf-colored tree ( T , σ ) holds (cid:126) G ( T , σ ) = (cid:126) G ( A ( T ) , σ ) .Proof. Let u ∈ V ( T ) and T u be the tree after augmenting T at vertex u . Put A : = { uw | w ∈ V ¬ T } and note thatall edges of T u in A are inner edges. Now consider e ∈ A . Since w ∈ V ¬ T , an edit step was performed to obtain w and thus, | C | > C T ( u ) . Lemma 4.12 and | C | > v (cid:48) ∈ child T u ( u ) with v (cid:48) (cid:54) = w we have σ ( L ( T u ( v (cid:48) ))) ∩ σ ( L ( T u ( w ))) = /0. Thus, Cor. 2.11 implies that the edge uw is redundant in ( T u , σ ) w.r.t. (cid:126) G ( T , σ ) .Denoting by T u A the tree obtained from T u by contraction of all edges in A , we obtain ( T , σ ) = ( T u A , σ ) .Lemma 2.13 now implies (cid:126) G ( T u , σ ) = (cid:126) G ( T u A , σ ) = (cid:126) G ( T , σ ) for every augmentation step. By Lemma 4.14, we canrepeat this argument for every augmentation in the arbitrary order in which (cid:126) G ( A ( T ) , σ ) is obtained from (cid:126) G ( T , σ ) ,and thus (cid:126) G ( A ( T ) , σ ) = (cid:126) G ( T , σ ) . While the least resolved tree in general cannot support an event labeling that properly reflects the underlying truehistory of a gene family, we shall see here that the augmented tree ( A ( T ) , σ ) does feature sufficient resolution. Tothis end, we investigate the extremal event labeling of ( A ( T ) , σ ) . Lemma 4.17.
Let (cid:98) t A ( T ) be the extremal event labeling of the augmented tree ( A ( T ) , σ ) obtained from ( T , σ ) andlet u be some vertex of A ( T ) . If (cid:98) t A ( T ) ( u ) = (cid:3) , then C A ( T ) ( u ) is connected.Proof. Suppose that (cid:98) t A ( T ) ( u ) = (cid:3) . There are two possibilities:(1) u ∈ V ( T ) . If C T ( u ) is connected, then C A ( T ) ( u ) = C T ( u ) . Otherwise, Lemma 4.12 implies that σ ( L ( A ( T )( w ))) ∩ σ ( L ( A ( T )( w ))) = /0 for all w , w ∈ child A ( T ) ( u ) , thus the definition of the extremal eventlabeling implies (cid:98) t A ( T ) ( u ) (cid:54) = (cid:3) , a contradiction.(2) u ∈ V ¬ T , i.e., u is newly created by augmenting some u (cid:48) ∈ V ( T ) , hence C T ( u ) is connected and, by Obs. 4.11and Lemma 4.13, C A ( T ) ( u ) is connected. Lemma 4.18.
Let ( (cid:126) G , σ ) be a BMG and ( T ∗ , σ ) its unique least resolved tree. Moreover, let (cid:98) t : = (cid:98) t A ( T ∗ ) be theextremal event labeling of the augmented tree ( A ( T ∗ ) , σ ) . Then, Θ ( A ( T ∗ ) , (cid:98) t ) ⊆ (cid:126) G.Proof.
Since ( T ∗ , σ ) explains ( (cid:126) G , σ ) , we have ( (cid:126) G , σ ) = (cid:126) G ( T ∗ , σ ) . By Prop. 4.16, we have (cid:126) G ( T ∗ , σ ) = (cid:126) G ( A ( T ∗ ) , σ ) .Let xy be an edge in Θ ( A ( T ∗ ) , (cid:98) t ) . By definition, (cid:98) t ( lca A ( T ∗ ) ( u )) = (cid:32) where u : = lca A ( T ∗ ) ( x , y ) . By definitionof the extremal event labeling, σ ( L ( A ( T ∗ )( v ))) ∩ σ ( L ( A ( T ∗ )( v ))) = /0 for all two distinct vertices v , v ∈ child A ( T ∗ ) ( u ) . The latter is true, in particular, for the two children v x , v y ∈ child A ( T ∗ ) ( u ) with x (cid:22) A ( T ∗ ) v x and y (cid:22) A ( T ∗ ) v y . Therefore, σ ( x ) / ∈ σ ( L ( A ( T ∗ )( v y ))) and σ ( y ) / ∈ σ ( L ( A ( T ∗ )( v x ))) . We conclude that x and y arereciprocal best matches in A ( T ∗ ) . Finally, ( (cid:126) G , σ ) = (cid:126) G ( A ( T ∗ ) , σ ) implies that xy is an edge in (cid:126) G .Now we are in the position to prove the main result of this contribution. Theorem 4.19.
Let ( (cid:126) G , σ ) be a BMG, ( T ∗ , σ ) its unique least resolved tree, and (cid:98) t : = (cid:98) t A ( T ∗ ) the extremal eventlabeling of the augmented tree ( A ( T ∗ ) , σ ) . Then ( Θ ( A ( T ∗ ) , (cid:98) t ) , σ ) = NH ( (cid:126) G , σ ) . roof. Let ( G , σ ) be the symmetric part of ( (cid:126) G = ( V , E ) , σ ) . For simplicity, we write G Θ : = Θ ( A ( T ∗ ) , (cid:98) t ) and G NH : =( V , E ( NH ( (cid:126) G , σ ))) . Recall that, by definition, G NH ⊆ G and, by Lemma 4.18, G Θ ⊆ (cid:126) G . Finally, as G contains onlyedges of (cid:126) G , we have G Θ ⊆ G . Let F : = E ( G ) \ E ( G NH ) be the set of all edges of G that are hug-edges, and let F (cid:48) : = E ( G ) \ E ( G Θ ) be the set of all edges in G that do not form orthologous pairs. Since G NH , G Θ ⊆ G it suffices toverify that F = F (cid:48) in order to show that ( G Θ , σ ) = ( G NH , σ ) .Assume e = xy ∈ F (cid:48) . Hence, xy / ∈ E ( G Θ ) and therefore, (cid:98) t ( u ) = (cid:3) where u : = lca A ( T ∗ ) ( x , y ) . By Lemma 4.17, C A ( T ∗ ) ( u ) has exactly one connected component. This together with Thm. 4.5 implies that xy is a hug-edge and thus, xy ∈ F , and hence F (cid:48) ⊆ F .Assume e = xy ∈ F is a hug-edge. Assume, for contradiction, that e / ∈ F (cid:48) and thus, (cid:98) t ( u ) = (cid:32) where u : = lca A ( T ∗ ) ( x , y ) . By definition of the extremal event labeling, it must therefore hold that σ ( L ( A ( T ∗ )( v ))) ∩ σ ( L ( A ( T ∗ )( v ))) = /0 for any two distinct vertices v , v ∈ child A ( T ∗ ) ( u ) . By Prop. 4.16, ( A ( T ∗ ) , σ ) explains ( (cid:126) G , σ ) . This together with Lemma 4.6 implies that there are two distinct vertices v , v ∈ child A ( T ∗ ) ( u ) such that σ ( L ( A ( T ∗ )( v ))) ∩ σ ( L ( A ( T ∗ )( v ))) (cid:54) = /0; a contradiction. Therefore, e ∈ F (cid:48) , and hence F ⊆ F (cid:48) . Corollary 4.20.
An edge xy in a BMG ( (cid:126) G , σ ) is u-fp if and only if xy is a hug-edge of ( (cid:126) G , σ ) .Proof. Let ( (cid:126) G , σ ) be a BMG, ( T ∗ , σ ) its unique least resolved tree, and (cid:98) t : = (cid:98) t A ( T ∗ ) the extremal event labeling ofthe augmented tree ( A ( T ∗ ) , σ ) . As shown in the proof of Thm. 4.19, every edge xy of of the symmetric part G thatis not a hug-edge satisfies xy ∈ E ( G Θ ) and therefore (cid:98) t ( u ) = (cid:32) , where u : = lca A ( T ∗ ) ( x , y ) . Lemma 3.2 implies that e is not ( A ( T ∗ ) , σ ) - fp and thus, in particular, not u-fp . That is, all edges in ( G Θ , σ ) = ( G NH , σ ) are non- u-fp edges.Moreover, Thm. 4.5 implies that all hug-edges in E ( G ) \ E ( G NH ) are u-fp . Since ( G NH , σ ) does not contain u-fp edges, all u-fp edges must also be hug-edges, which completes the proof.The results imply that the no-hug graph NH ( (cid:126) G , σ ) is obtained from ( (cid:126) G , σ ) by removing all u-fp edges. This andthe fact that NH ( (cid:126) G , σ ) = ( Θ ( A ( T ∗ ) , (cid:98) t ) , σ ) is an orthology graph implies that NH ( (cid:126) G , σ ) is the best estimate of theorthology relation that we can make for a given BMG ( (cid:126) G , σ ) . By Thm. 2.20, NH ( (cid:126) G , σ ) must also be a cograph.We next show that the computation of NH ( (cid:126) G , σ ) can be achieved in polynomial time. In fact, the effort isdominated by computing the least resolved tree ( T ∗ , σ ) for a given BMG. Theorem 4.21.
For a given BMG ( (cid:126) G , σ ) , the set of all u-fp edges can be computed in O ( | L | | S | ) time, where L = V ( (cid:126) G ) and S = σ ( L ( T )) is the set of species under consideration.Proof. Given a BMG ( (cid:126) G , σ ) , its least resolved tree ( T ∗ , σ ) can be computed in O ( | L | | S | ) time (cf. Thm. 2.6 and [12,Sect. 5]). The augmented tree ( A ( T ∗ ) , σ ) can be obtained from ( T ∗ , σ ) in O ( | L | | S | ) time according to Lemma 4.15.The extremal event labeling (cid:98) t can be obtained from the connectivity information on the C A ( T ∗ ) ( u ) in linear time.Computing ( Θ ( A ( T ∗ ) , (cid:98) t ) , σ ) = NH ( (cid:126) G , σ ) then only requires evaluation of lca A ( T ∗ ) ( x , y ) , which can be achieved inpolynomial time in O ( | L | ) as described in [12, Sect. 5]).As argued in [12, Sect. 5], in practical applications the number of genes between different species will be com-parable, that is, O ( (cid:96) ) = O ( | L | / | S | ) with (cid:96) = max s ∈ S | L [ s ] | . In this case, the running time to compute ( T ∗ , σ ) reducesto O ( | L | / | S | ) and we obtain an overall running time to compute the set of all u-fp edges of O ( | L | / | S | + | L | | S | ) .Thm. 4.19 and 4.21 imply that we do not need to find induced quartets and hourglasses explicitly, nor do we need toidentify the hourglass chains. Instead, it is more efficient to compute the least resolved tree ( T ∗ , σ ) , its augmentations A ( T ∗ , σ ) , and the corresponding extremal event labeling (cid:98) t . The characterization of u-fp edges is in a way surprising when compared to previous results on the structure ofRBMGs [13, 14], where quartets were a focal point of the investigation. On one hand, Prop. 3.15 provides theexpected connection of u-fp edges with good and ugly quartets, while on the other hand, the u-fp edges in hourglasses,cf. Prop. 3.20, show that u-fp edges can be entirely unrelated to quartets and thus induced P s. In this section, we aimto close this gap in our understanding. We start with a special case for which quartets are sufficient.
Definition 5.1.
A BMG ( (cid:126) G , σ ) is hourglass-free if it does not contain an hourglass as an induced subgraph. x x x x x x x x x x x x x x x x x x x x x x x Figure 9:
Two examples of trees whose BMGs (cid:126) G ( T , σ ) contain a hexagon (cid:104) x x x x x x (cid:105) . There are exactlytwo distinct possibilities for the placement of the non-symmetric arcs in the subgraph of the BMG induced by thehexagon, see proof of Lemma 5.2.In particular, an hourglass-free BMG does not contain an hourglass chain either. Geiß et al. [14] found that acertain type of colored 6-cycles is an important characteristic of RBMGs with a “complicated” structure that can onlybe explained by multifurcating trees. Let us write (cid:104) x x ... x k (cid:105) for an induced cycle C k with edges x i x i + , 1 ≤ i ≤ k − x k x in the symmetric part G of (cid:126) G . We say that ( (cid:126) G , σ ) contains a hexagon if the corresponding RBMG ( G , σ ) contains an induced C = (cid:104) x x ... x (cid:105) such that any three consecutive vertices of C have pairwise distinct colors,i.e., σ ( x i ) = σ ( x i + ) , 1 ≤ i ≤
3. A graph ( (cid:126) G , σ ) is hexagon-free if it does not contain a hexagon. Lemma 5.2.
If a BMG ( (cid:126) G , σ ) is hourglass-free, then it is hexagon-free.Proof. By contraposition, suppose that ( (cid:126) G , σ ) contains a hexagon (cid:104) x x ... x (cid:105) . Thus, P = (cid:104) x x x x (cid:105) is an induced P in (cid:126) G with σ ( x ) = σ ( x ) . By [14, Lemma 32]), this P is either a good or a bad quartet. Hence, either the arcs ( x , x ) , ( x , x ) ∈ E ( (cid:126) G ) and ( x , x ) , ( x , x ) / ∈ E ( (cid:126) G ) , or ( x , x ) , ( x , x ) ∈ E ( (cid:126) G ) and ( x , x ) , ( x , x ) / ∈ E ( (cid:126) G ) , seeFig. 9.Assume that the arcs ( x , x ) and ( x , x ) exist, i.e., P is a good quartet. Now P (cid:48) = (cid:104) x x x x (cid:105) is an induced P in (cid:126) G as well and satisfies σ ( x ) = σ ( x ) . Since the arc ( x , x ) exist and by the arguments above, P (cid:48) can onlybe a bad quartet and thus, ( x , x ) ∈ E ( (cid:126) G ) and ( x , x ) / ∈ E ( (cid:126) G ) . Repeating the latter arguments while traversing theinduced C implies that ( x , x ) , ( x , x ) ∈ E ( (cid:126) G ) and ( x , x ) , ( x , x ) / ∈ E ( (cid:126) G ) . Hence, we obtain the hourglass [ x x (cid:38)(cid:37) x x ] . Similar arguments imply that there is an hourglass in ( (cid:126) G , σ ) if ( x , x ) , ( x , x ) ∈ E ( (cid:126) G ) and ( x , x ) , ( x , x ) / ∈ E ( (cid:126) G ) .Clearly, the converse of Lemma 5.2 is not always satisfied, since, by Obs. 3.18, an hourglass is a BMG withouthexagons.A very useful observation in previous work is the fact that every 3-colored vertex induced subgraph of an RBMG ( G , σ ) is again an RBMG [14, Thm. 7]. Furthermore, the connected components ( C , σ ) of every 3-colored vertexinduced subgraph of ( G , σ ) belong to precisely one of the three types [14, Thm. 5]: Type (A) ( C , σ ) contains a K on three colors but no induced P . Type (B) ( C , σ ) contains an induced P on three colors whose endpoints have the same color, but no induced cycle C n on n ≥ Type (C) ( C , σ ) contains a hexagon.The graphs for which all such 3-colored connected components are of Type (A) are exactly the RBMGs that arealready cographs, or co-RBMGs for short [14, Thm. 8 and Remark 2]. Together with Lemma 5.2, this classificationimmediately implies Corollary 5.3.
Let ( (cid:126) G , σ ) be an hourglass-free BMG. Then its symmetric part ( G , σ ) is either a co-RBMG or itcontains an induced P on three colors whose endpoints have the same color, but no induced cycle C n on n ≥ vertices. We already know from Prop. 3.15 and Cor. 4.20 that all u-fp edges in an hourglass-free BMG are identified bythe good and ugly quartets, which are 3-colored by construction. In hourglass-free BMGs, it is indeed sufficient toconsider only the 3-colored P s to identify all u-fp edges and thus, to obtain an orthology graph, even though theBMG may also contain 4-colored P s. Since hourglasses can only appear in BMGs that require multifurcations fortheir explanation (cf. Lemma 3.19), the case of hourglass-free BMGs is the most relevant for practical applications.Since all u-fp edges in an hourglass-free BMG are contained in quartets, it is also easy to identify the ones thatare already orthology graphs. Corollary 5.4.
Let ( (cid:126) G , σ ) be an hourglass-free BMG. Then, its symmetric part ( G , σ ) is a co-RBMG if and only ifthere are no u-fp edges in ( (cid:126) G , σ ) . roof. Since ( G , σ ) is a cograph, it contains no induced P s and thus, ( (cid:126) G , σ ) contains no good or ugly quartets.By Cor. 4.20, all hug-edges are determined by hourglass chains and good or ugly quartets. Since none of them iscontained in ( (cid:126) G , σ ) , it also does not contain u-fp edges. Conversely, suppose that ( (cid:126) G , σ ) contains no u-fp edges.Then, by Thm. 4.19, ( G , σ ) = NH ( (cid:126) G , σ ) is an orthology graph and thus, by Thm. 2.20, a cograph. u-fp Edges in Hourglass Chains
The situation is much more complicated in the presence of hourglasses. We start by providing sufficient conditionsfor u-fp edges that are identified by hourglass chains.
Proposition 5.5.
Let H = [ x y (cid:38)(cid:37) x (cid:48) y (cid:48) ] ,..., [ x k y k (cid:38)(cid:37) x (cid:48) k y (cid:48) k ] be an hourglass chain in ( (cid:126) G , σ ) , possibly with a left tail zor a right tail z (cid:48) . Then, an edge in (cid:126) G is u-fp if it is contained in the setF = { x i y j | ≤ i ≤ j ≤ k } ∪ { zz (cid:48) } ∪ { zy i , x i z (cid:48) , zy (cid:48) i , x (cid:48) i z (cid:48) | ≤ i ≤ k }∪ { x i x j + | ≤ i < j < k } ∪ { y i y j + | ≤ i < j < k }∪ { x (cid:48) y (cid:48) i , x (cid:48) y i | ≤ i ≤ k } ∪ { x i y (cid:48) k , x (cid:48) i y (cid:48) k | ≤ i ≤ k − }∪ { x (cid:48) z , x (cid:48) z (cid:48) , y (cid:48) k z , y (cid:48) k z (cid:48) } Proof.
Let ( T , σ ) be an arbitrary tree that explains ( (cid:126) G , σ ) . By analogous arguments as in the proof of Lemma 3.25and by Lemma 3.24, there is a vertex u ∈ V ( T ) with pairwise distinct children v , v ,..., v k , v k + such that it holds x ∈ L ( T ( v )) , y k ∈ L ( T ( v k + )) and, for all 1 ≤ i ≤ k , we have x (cid:48) i , y (cid:48) i ∈ L ( T ( v i )) . Since x i + = y (cid:48) i and x (cid:48) i + = y i bydefinition of hourglass chains, it is an easy task to verify that for all edges e = ab ∈ F the vertices a and b are locatedbelow distinct children of u and thus, lca T ( a , b ) = u for all such edges. As argued in the proof of Lemma 3.25, wehave σ ( L ( T ( v ))) ∩ σ ( L ( T ( v ))) (cid:54) = /0. The latter arguments together with Lemma 3.2 imply that every edge in F is u-fp .Figs. 6 and 10 furthermore show that hourglass chains identify false-positive edges that are not associated withquartets in the BMG: The BMG in Fig. 6(A) has the u-fp edge xy , and the BMG in Fig. 10(B) contains the u-fp edges x y , x z (cid:48) and x (cid:48) z (cid:48) . A careful investigation shows that these edges are either not even part of an induced P (such as xy in Fig. 6 and x (cid:48) z (cid:48) in Fig. 10), or at least not identifiable as u-fp via good, bad or ugly quartets according to Props. 3.11,3.13 and 3.14, as it is the case for x y and x z (cid:48) in Fig. 10.This observation limits the use of cograph-editing in the context of orthology detection, at least in the case ofgene trees with polytomies: On one hand, Fig. 6 shows that an RBMG ( G , σ ) can be a cograph and still contain u-fp edges, on the other hand, 10(C) shows that deletion of the u-fp edge identified by quartets and thus, induced P s isnot sufficient to arrive at a cograph. P s Geiß et al. [14, Thm. 8] establishes that the RBMG ( G , σ ) is a co-RBMG, i.e., a cograph, if and only if every subgraphinduced on three colors is a cograph. Therefore, if ( G , σ ) contains an induced 4-colored P , it also contains an induced3-colored P . For hourglass-free BMGs ( (cid:126) G , σ ) it is clear that a 4-colored P always overlaps with a 3-colored P :In this case NH ( (cid:126) G , σ ) is obtained by deleting middle edges of good quartets and first edges of ugly quartets. Since NH ( (cid:126) G , σ ) is a cograph, there is no P left, and thus at least one edge of any 4-colored P was among the deletededges. It is natural to ask whether this is true for BMGs in general. Fig. 11 shows that good and ugly quartets arenot sufficient on their own: there are 4-colored P s that do not overlap with the middle edge of a good quartet or thefirst edge of an ugly quartet. On the other hand, it is clear that at least one of its edges is u-fp . This does not imply,however, that the u-fp edges in a 4-colored P are also edges of 3-colored P s.Still, in the context of cograph-editing approaches it is of interest whether the 3-colored P -s are sufficient. In thefollowing we provide an affirmative answer. Lemma 5.6.
Let ( (cid:126) G , σ ) be a BMG and P a 4-colored induced P in the symmetric part of ( (cid:126) G , σ ) . Then at least oneof the edges of P is either the middle edge of some good quartet or the first edge of a bad or ugly quartet in ( (cid:126) G , σ ) .Proof. Let ( T , σ ) be an arbitrary tree that explains ( (cid:126) G , σ ) and suppose that P : = (cid:104) abcd (cid:105) is a 4-colored induced P inthe symmetric part ( G , σ ) .If one of the edges ab , bc , or cd of P is the middle edge of some good quartet or the first edge of some uglyquartet, then we are done. Hence, we assume in the following that this is not the case and show that at least one ofthe edges of P is the first edge in a bad quartet.By contraposition of Prop. 3.15, we have S ∩ ( a , b ) = /0, S ∩ ( b , c ) = /0 and S ∩ ( c , d ) = /0. We set v : = lca T ( b , c ) with children v b , v c ∈ child T ( v ) such that b (cid:22) T v b and c (cid:22) T v c , and w : = lca T ( a , b ) with children w a , w b ∈ child T ( w ) such that a (cid:22) T w a and b (cid:22) T w b . Note, that v , v b , w , and w b are pairwise comparable, since they are all ancestors of b . x' y' y y' y x x' z' x y y' x' y' y z' x x' y' y y' y z'y y' x' y' y y' y z' x y y' y y y' x' y y y' y' z' x y y' x' y' y z' x y' x' y' x y y' x' y' y z' A B C DEG F H (T, σ ) Figure 10:
The (non-binary) tree ( T , σ ) in Panel (A) explains the BMG ( (cid:126) G , σ ) in Panel (B), which contains severalinduced P s and an hourglass chain of length k = z (cid:48) . Edges that are not ( T , σ ) - fp (and thus not u-fp ) are shown as thick lines. Thin edges correspond to those that can be identified as u-fp by the subgraphs in(E–H), where they are highlighted in red. (C) The graph after deletion of all edges that can be identified by good,bad and ugly quartets according to Props. 3.11, 3.13, and 3.14. Note that it contains the induced P s (cid:104) y (cid:48) x (cid:48) z (cid:48) y (cid:105) and (cid:104) y (cid:48) x (cid:48) z (cid:48) x (cid:105) , which were not induced subgraphs of the orginal BMG in (B). Its symmetric part ( H , σ ) differsfrom NH ( (cid:126) G , σ ) (cf. Def. 4.7) since it still contains u-fp edges. (D) The BMG after deletion of all u-fp edges. Itssymmetric part, comprising the thick edges, is NH ( (cid:126) G , σ ) . (E) The two good quartets. (F) The single bad quartet.(G) Examples for ugly quartets that cover the remaining u-fp edges that are identifiable via quartets. Panel (H)shows the BMG ( (cid:126) G , σ ) in a different layout that highlights the hourglass chain with right tail z (cid:48) . All edges that are u-fp according to Prop. 5.5 are in red. To identify the u-fp edges in ( (cid:126) G , σ ) , only the subgraphs in Panel (E), (G)and (H) are necessary (cf. Def. 4.4 and Thm. 4.19).We show that w = v . Assume, for contradiction, that (i) w ≺ T v or (ii) v ≺ T w . In Case (i), we have w a ≺ T w (cid:22) T v b and thus, σ ( a ) ∈ σ ( L ( T ( v b ))) . Hence, as S ∩ ( b , c ) = /0, it must hold that σ ( a ) / ∈ σ ( L ( T ( v c ))) and σ ( c ) / ∈ σ ( L ( T ( v b ))) . Lemma 2.4 implies ac ∈ E ( G ) . But then P is not an induced P ; a contradiction. In Case (ii), wehave v c (cid:22) T v (cid:22) w b and thus, σ ( c ) ∈ σ ( L ( T ( w b ))) . Since S ∩ ( a , b ) = /0 we thus have σ ( c ) / ∈ σ ( L ( T ( w a ))) and σ ( a ) / ∈ σ ( L ( T ( w b ))) . By Lemma 2.4, ac ∈ E ( G ) ; again a contradiction. Thus w = v . Analogous arguments canbe used to establish lca T ( c , d ) = v . We therefore have v = lca T ( a , b ) = lca T ( b , c ) = lca T ( c , d ) . In the following v x denotes the child of v with x (cid:22) T v x for x ∈ { a , b , c , d } . Note, v a (cid:54) = v b , v b (cid:54) = v c and v c (cid:54) = v d .We next show that v a , v b , v c , and v d are pairwise distinct. Fist, assume for contradiction that v a = v c . Togetherwith S ∩ ( c , d ) = /0, this assumption implies that σ ( a ) / ∈ σ ( L ( T ( v d ))) and σ ( d ) / ∈ σ ( L ( T ( v c ))) . By Lemma 2.4, ad ∈ E ( G ) , contradicting the assumption that P is an induced P . Hence, v a (cid:54) = v c . By symmetry of P , we can usesimilar arguments to conclude that v b (cid:54) = v d . Finally, assume for contradiction that v a = v d . Then, σ ( d ) ∈ σ ( L ( T ( v a ))) .Hence, S ∩ ( a , b ) = /0 implies that σ ( d ) / ∈ σ ( L ( T ( v b ))) and σ ( b ) / ∈ σ ( L ( T ( v d ))) . Again Lemma 2.4 implies bd ∈ E ( G ) ; a contradiction. In summary, v a , v b , v c , and v d must be pairwise distinct.We claim σ ( c ) ∈ σ ( L ( T ( v a ))) . Since ad / ∈ E ( G ) and lca T ( a , d ) = v , Lemma 2.4 implies that σ ( a ) ∈ σ ( L ( T ( v d ))) or σ ( d ) ∈ σ ( L ( T ( v a ))) . By symmetry of P , we can w.l.o.g. assume that σ ( a ) ∈ σ ( L ( T ( v d ))) and thus, there is avertex a d ∈ L ( T ( v d )) with σ ( a d ) = σ ( a ) . In this case, S ∩ ( c , d ) = /0 implies that σ ( a ) / ∈ σ ( L ( T ( v c ))) . This togetherwith ac / ∈ E ( G ) and Lemma 2.4 implies that σ ( c ) ∈ σ ( L ( T ( v a ))) .We claim σ ( d ) ∈ σ ( L ( T ( v a ))) . We assume for contradiction that this is not the case and show that this impliesthe existence of an ugly quartet (cid:104) cdc (cid:48) a (cid:48) (cid:105) containing cd as its first edge, which leads to a contradiction to our initialassumption that none of the edges in P is the first, resp., middle edge of an ugly, resp., good quartet. To see this,note that σ ( a ) , σ ( c ) ∈ σ ( L ( T ( v a ))) and Lemma 2.3 imply that there is an edge a (cid:48) c (cid:48) for two vertices a (cid:48) , c (cid:48) ≺ T v a with σ ( a (cid:48) ) = σ ( a ) and σ ( c (cid:48) ) = σ ( c ) . Since σ ( a ) = σ ( a (cid:48) ) and lca T ( a (cid:48) , c (cid:48) ) (cid:22) T v a ≺ T v = lca T ( a (cid:48) , c ) , we have a (cid:48) c / ∈ E ( G ) .Since σ ( a d ) = σ ( a (cid:48) ) and lca T ( a d , d ) (cid:22) T v d ≺ T v = lca T ( a (cid:48) , d ) , we have a (cid:48) d / ∈ E ( G ) . Now, S ∩ ( c , d ) implies that σ ( c ) / ∈ σ ( L ( T ( v d ))) . This and σ ( d ) / ∈ σ ( L ( T ( v a ))) together with Lemma 2.4 implies that there is an edge c (cid:48) d ∈ E ( G ) .Thus, we obtain the ugly quartet (cid:104) cdc (cid:48) a (cid:48) (cid:105) and hence, the desired contradiction. Therefore, σ ( d ) ∈ σ ( L ( T ( v a ))) .Because of S ∩ ( a , b ) = /0 we also have σ ( d ) / ∈ σ ( L ( T ( v b ))) . c a a d b d d a ab d c a a d b d d a a db c Figure 11:
The symmetric part of the BMG ( (cid:126) G , σ ) contains the 4-colored induced P (cid:104) abcd (cid:105) . None of its edgesis the middle edge of a good quartet or the first edge of an ugly quartet. According to Lemma 5.6, there is the badquartet (cid:104) abca d (cid:105) that contains as first edge the edge ab .Since σ ( d ) ∈ σ ( L ( T ( v a ))) , there is a vertex d a (cid:22) v a with σ ( d a ) = σ ( d ) . Moreover, σ ( b ) / ∈ σ ( L ( T ( v a )) and σ ( d ) / ∈ σ ( L ( T ( v b ))) together with Lemma 2.4 implies that bd a ∈ E ( G ) . Furthermore, σ ( c ) ∈ σ ( L ( T ( v a ))) andLemma 2.4 imply that cd a / ∈ E ( G ) . Now, S ∩ ( c , d ) = /0 implies σ ( d ) / ∈ σ ( L ( T ( v c ))) and therefore, lca T ( c , d a ) = v (cid:22) lca T ( c , d (cid:48) ) for all d (cid:48) ∈ L [ σ ( d )] . Hence, ( c , d a ) ∈ E ( (cid:126) G ) .In summary, (cid:104) dcbd a (cid:105) is an induced P in G . By [14, Lemma 32], every such induced P forms either a good, bad,or ugly quartet in ( (cid:126) G , σ ) and, since ( c , d a ) ∈ E ( (cid:126) G ) , we can conclude that (cid:104) dcbd a (cid:105) is a bad quartet with first edge cd ,which completes the proof. Corollary 5.7. [14, Thm. 8] Let ( G , σ ) be an RBMG. Then, ( G , σ ) is a cograph if and only if all subgraphs inducedby three colors are cographs.Proof. If ( G , σ ) is a cograph, then all its induced subgraphs are also cographs [4]. Conversely, if ( G , σ ) is not acograph, then it contains at least one induced P . By Lemma 5.6, ( G , σ ) cannot contain only 4-colored P s andtherefore the restriction to at least one combination of three colors contains a P and is thus not a cograph.It is important to recall in this context, however, that the deletion of all u-fp -edges identified by quartets doesnot necessarily lead to a cograph as the example in Fig. 10(C) shows. The quartets alone therefore do not provide acomplete algorithm for correcting an RBMG to an orthology graph. We illustrate the potential impact of our mathematical results discussed in the previous sections with the help ofsimulated data. To this end, we focus on the accuracy of the inferred orthology graph assuming that the best matchesare accurate. Of course, this is only one of several components in complete orthology detection pipeline, which wouldalso need to consider the genome annotation, pairwise alignments of genes or predicted protein sequences, and theconversion of sequence similarities into best match data. The latter step has been investigated in considerable detailby Stadler et al. [46]. Here, we start from simulated evolutionary scenarios and extract the BMG directly from theground truth using the simulation library
AsymmeTree [46].In brief,
AsymmeTree generates realistic evolutionary scenarios in four steps. (1) A planted species tree S isgenerated using the Innovation Model [25], which models observed phylogenies well. (2) A dating map τ assignstime points to all vertices of S and thus branch lengths to the edges of S . (3) On S , we use a variant of the well-knownconstant-rate birth-death process with a given age [see e.g. 17, 26] to simulate an event-labeled gene tree ( T , t , σ ) containing duplication and loss events. Speciations are included as additional branching events that generate copiesof all genes present at a speciation vertex in all descendant lineages. The simulated gene trees are constrained tohave at least one surviving gene in each species to avoid trivial cases. (4) The observable part of the gene tree isextracted by recursively removing leaves that correspond to loss events and suppressing inner vertices with a singlechild. AsymmeTree can also assign rates to edges of ( T , t , σ ) to convert evolutionary time differences into generaladditive distances; however, this is not relevant here since the rates do not affect evolutionary relatedness and thus theBMG.Extending the simulations used in [13, 46], we also consider non-binary gene trees. This is important here since,by Lemma 3.19, hourglasses cannot appear in BMGs that are explained by a binary tree. There is an ongoing dis-cussion to what extent polytomies in phylogenetic trees are biological reality as opposed to an artifact of insufficient .0 0.5 1.0 1.5 2.0 level of non-binary duplications ( λ ) m e a np e r ce n t a g e o f a ll f p e d g e s } } } } } only good good and ugly only ugly remaining hug remaining fp Figure 12:
Average relative abundance of the different types of hug-edges and undetectable false positives inthe BMGs of simulated evolutionary scenarios. We distinguish hug-edges in good and ugly quartets as well ashug-edges appearing only in hourglass chains (orange). In the simulations, the fraction of u-fp edges that are firstedges of bad quartets is too small too be visible and therefore not shown here. The undetectable false positivescorrespond to complementary gene losses without surviving witnesses of the duplication event. Species trees arebinary, while gene trees contain multifurcations. The number of offsprings is modeled as 2 + k , where k is drawnfrom a Poisson distribution with parameter λ . For λ =
0, the gene trees are binary. In the experiments, we observedthat on average 62.4% of the 25000 simulated BMGs do not contain any false-positive edge (cf. Fig. 13). Thoseinstances are included in the computation of the fraction | F | / | E ( G ) | (percentage above the bars). However, for thecomputation of all other values only scenarios that contain false-positives are considered.resolution. At the level of species trees, the assumption that cladogenesis occurs by a series of bifurcations [e.g. 5, 33]seems to be prevailing, several authors have argued quite convincingly that there is evidence for a least some bonafide multifurcations of species [27, 41, 47]. In the simulation, polytomies in species trees are introduced after the firststep by edge contraction with a user-defined probability p .The reality of polytomies is less clear for gene trees. One reason is the abundance of tandem duplications.Although the majority of tandem arrays comprises only a pair of genes, larger clusters are not at all rare [38]. Althoughone may argue that mechanistically they likely arise by stepwise duplications, such arrangements are often subject togene conversion and non-homologous recombination that keeps the sequences nearly identical for some time beforethey eventually escape from concerted evolution and diverge functionally [18, 31]. As a consequence, duplicationsin tandem arrays may not be resolvable unless witnesses of different stages of an ongoing duplication process havesurvived. To model polytomies in the gene tree, we modify step (3) of the simulation procedure by replacing a simpleduplication by the generation of 2 + k offspring genes. The number k of additional copies is drawn from a Poissondistribution with parameter λ > S is set to unity. The duplication and loss rates inthe gene trees are drawn i.i.d. from the uniform distribution on the interval [ . , . ) . Multifurcating gene trees wereproduced for λ = { . , . , . , . , . } . In total, we generated 5000 scenarios for each choice of p and λ . Since thetrue scenarios, and thus the true gene tree T , the true BMG (cid:126) G , and the corresponding RBMG G are known, we canalso determine the set F : = { xy | xy ∈ E ( G ) and t ( lca T ( x , y )) = (cid:3) } . (2)
25 50 75no. of duplication events020406080 n o . o f l o ss e v e n t s ( G, σ ) ( G good , σ ) NH ( ~G, σ ) f a l s e d i s c o v e r y r a t e Figure 13:
False discovery rates computed as proportion of fp among all edges averaged over all scenarios withgiven number of duplications and losses. Left:
RBMGs ( G , σ ) , i.e., | F | / | E ( G ) | . Middle: edited RBMG ( G good , σ ) with all middle edges of good quartets removed, i.e., | F \ U M | / | E ( G good ) | . Right: no-hug graphs NH ( (cid:126) G , σ ) , i.e., | F \ U | / | E ( NH ) | . Scenarios with more than 80 duplication/loss events are not shown.of false-positive edges. From the BMG, we compute the set U of u-fp edges as well as the subsets U M and U U of u-fp edges that are middle edges of a good or first edges of an ugly quartet, respectively. Note that in general we have U M ∩ U U (cid:54) = /0. We only discuss the results for binary species trees in some detail, since species trees with polytomiesyield qualitatively similar results. We observe that the relative abundance of u-fp edges in good and ugly quartetsincreases moderately for larger p .First, we note that, consistent with [13, 46], the fraction | F | / | E ( G ) | of false positive orthology assignments issmall in our data set, on the order of 3%. This indicates that, in real-life data, the main source of errors is likely theaccurate determination of best matches from sequence data rather than false-positive edges contained in the BMG.Considering the fraction | U | / | F | of u-fp edges in Fig. 12, we find that even in the most adverse case of all gene treesbeing binary, the BMG identifies more than three quarters of F . It may be surprising at first glance that the problembecomes easier with increasing λ and barely 6% of the false positives escape discovery. A likely explanation is thatmultifurcations increase the likelihood that an inner vertex has two surviving lineages that serve as witnesses of theevent; in addition, multifurcations increase the vertex degree in the BMG, so that in principle more information isavailable to resolve the tree structure. It is also interesting to note that U U \ U M is small, i.e., there are few casesof first edges in an ugly quartet that are not also middle edges in a good quartet. The fraction of u-fp edges thatappear only as first edges of bad quartets is even smaller; only 2-3% of the u-fp edges associated with hourglasschains, i.e., less than 0.15% of all u-fp edges are of this type. The overwhelming majority of u-fp edges associatedwith quartets thus appear (also) as middle edges of good quartets. This observation provides an explanation for theexcellent performance of removing the U M -edges proposed in [13]. In particular in the case of binary trees, whichwas considered by Geiß et al. [13], there is only a small number of other u-fp edges, which are completely coveredby U U . Fig. 13 visualizes the appearance of false-positive edges depending on the number of duplication and lossevents. Not surprisingly, F is enriched in scenarios with a large number of losses compared to the duplications, anddepleted when losses are rare. In fact, in the absence of losses, the RBMG equals the orthology graph, i.e., F = /0 [13,Thm. 4]. Removal of U M , already reduced the false positives considerably. We have shown here how all unambiguously false-positive orthology assignments can be identified in polynomialtime provided that all best matches are known. In particular, we have provided several characterizations for u-fp edges in terms of underlying subgraphs and refinements of trees. Since the best match graph contains only falsepositives, we have obtained a characterization of all unambiguously incorrect orthology assignments. Simulationsshowed that the majority of false positives comprises middle edges of good quartets, while u-fp edges that appearonly as first edges of an ugly quartet are rare. Not surprisingly, the hourglass-related u-fp edges become important ingene trees with many multifurcations.The augmented tree ( A ( T ∗ ) , σ ) is the least resolved tree that admits an event labeling such that all inner verticeswith child trees that have overlapping colors are designated as duplications while all inner vertices with color-disjointchild trees are designated as speciations. The tree ( A ( T ∗ ) , σ ) therefore does not contain “non-apparent duplications”in the sense of [28], i.e., duplication vertices with species-disjoint subtrees. This is an interesting connection linkingthe literature concerned with polytomy refinement in given gene trees [3, 28] with Best Match Graphs.The extremal event labeling (cid:98) t of ( A ( T ∗ ) , σ ) is the one that minimizes the necessary number of duplications on A ( T ∗ ) , σ ) . In a conceptual sense, therefore, ( A ( T ∗ ) , (cid:98) t ) is a “most parsimonious” solution, matching the idea ofmost parsimonious reconciliations [16, 37]. From a technical point of view, however, the problem we solve here isvery different. Instead of considering a given pair of gene tree T and species tree S , we ask here about the informationcontained in the BMG ( (cid:126) G , σ ) , i.e., we only consider the information on the species tree that is already implicitlycontained in ( (cid:126) G , σ ) . The construction of the event-labeled gene tree ( A ( T ∗ ) , (cid:98) t ) in fact implies a set S of informativetriples, namely those σ ( x ) σ ( y ) | σ ( z ) with σ ( x ) , σ ( y ) , σ ( z ) pairwise distinct and (cid:98) t ( lca A ( T ∗ ) ( x , y , z )) = (cid:32) , that aredisplayed by the species tree S [19, 24]. Nothing in our theory, however, ensures that S is a consistent set of triples,much less that S is consistent with a given species tree S .Since constraints on reconciliation maps deriving from the species phylogeny are fully expressed by informativetriples, no such constraint exists in particular for any vertex u of A ( T ∗ ) that has only leaves as children. That is, false-positive orthology assignments among the children of u cannot be identified from the BMG alone because there areno further descendants to witness u as duplication event. Additional evidence, such as the assumption of a molecularclock or synteny must be used to resolve situations such as the complementary loss shown in Fig. 1.On the other hand, every gene tree T can be reconciled with every species tree S [13, 16, 37] at the expense ofreassigning events as duplications. Clearly, if A ( T ∗ ) is already binary, consistency will require the relabeling ofsome speciation nodes as duplications. Can one characterize and efficiently compute the minimal relabelings? Inthe general case, a further refinement of A ( T ∗ ) may be sufficient. Is a refinement of speciation nodes sufficient, orare there in general speciation nodes in ( A ( T ∗ ) , (cid:98) t ) that need to be refined into separate speciation and duplicationevents?Since orthology graphs are cographs contained in the RBMG ( G , σ ) , it is of interest to compare the deletion ofall u-fp edges in ( G , σ ) with finding a (minimal) edge-deletion set to obtain a cograph. These two problems areclearly distinct: The simplest example is the BMG ( (cid:126) G , σ ) in Fig. 6(A): its symmetric part G is already a cograph but ( (cid:126) G , σ ) contains the hug-edge xy , which must be deleted. Despite its practical use [23, 29], this observation relegatescograph-editing [20, 32, 49] to the status of a heuristic approximation for the purpose of orthology detection.For practical applications, one has to keep in mind that best matches are inferred from sequence similarity data.Despite efforts to convert best (blast) hits into evolutionary best matches in a systematic manner [46], estimated BMGswill contain errors, which in most cases will violate the definition of best match graphs. This begs the question howan empirical estimate of a BMG can be corrected to a closest “correct” BMG that (approximately) fits the data. Theanalogous RBMG-editing problem is NP-hard [21]. Complexity results for the BMG-editing problem, however, havenot become available so far.Orthology prediction tools intended for large data sets often do not attempt to infer the orthology graph, butinstead are content with summarizing the information as clusters of orthologous groups (COGs) in an empiricallyestimated RBMG [39, 48]. Formally, this amounts to editing the BMG to a set of disjoint cliques. The example inFig. 7 shows that this approach can destroy correct orthology information: the BMG ( (cid:126) G , σ ) does not contain u-fp edges and thus, it is the closest orthology graph. However, ( (cid:126) G , σ ) is not the disjoint union of cliques. Acknowledgements
We thank Carsten R. Seemann for fruitful discussions and his helpful comments. This work was supported in partby the Austrian Federal Ministries BMK and BMDW and the Province of Upper Austria in the frame of the COMETProgramme managed by FFG, and by the German Research Foundation (DFG, grant no. STA 850/49-1).
References [1] Adrian M Altenhoff, Brigitte Boeckmann, Salvador Capella-Gutierrez, Daniel A Dalquen, Todd DeLuca,Kristoffer Forslund, Jaime Huerta-Cepas, Benjamin Linard, C´ecile Pereira, Leszek P Pryszcz, Fabian Schreiber,Alan Sousa da Silva, Damian Szklarczyk, Cl´ement-Marie Train, Peer Bork, Odile Lecompte, Christian vonMering, Ioannis Xenarios, Kimmen Sj¨olander, Lars Juhl Jensen, Maria J Martin, Matthieu Muffato, Quest forOrthologs consortium, Toni Gabald´on, Suzanna E Lewis, Paul D Thomas, Erik Sonnhammer, and ChristopheDessimoz. Standardized benchmarking in the quest for orthologs.
Nature Methods , 13:425–430, 2016.[2] Sebastian B¨ocker and Andreas W. M. Dress. Recovering symbolically dated, rooted trees from symbolic ultra-metrics.
Adv. Math. , 138:105–125, 1998.[3] W C Chang and O Eulenstein. Reconciling gene trees with apparent polytomies. In D Z Chen and D T Lee,editors,
Computing and Combinatorics. COCOON 2006 , volume 4112 of
Lect. Notes Comp. Sci. , pages 235–244, Berlin, Heidelberg, 2006. Springer.[4] D G Corneil, H Lerchs, and L S Burlingham. Complement reducible graphs.
Discr. Appl. Math. , 3:163–174,1981.
5] R. DeSalle, R. Absher, and G. Amato. Speciation and phylogenetic resolution.
Trends Ecol. Evol. , 9:297–298,1994.[6] Christophe Dessimoz, Brigitte Boeckmann, Alexander C. J. Roth, and Gaston H. Gonnet. Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits.
Nucleic Acids Res , 34:3309–3316, 2006.[7] J-P Doyon, V Ranwez, V Daubin, and V Berry. Models, algorithms and programs for phylogeny reconciliation.
Brief Bioinform. , 12:392–400, 2011.[8] W M Fitch. Distinguishing homologous from analogous proteins.
Syst Zool , 19:99–113, 1970.[9] Toni Gabald´on and Eugene V. Koonin. Functional and evolutionary implications of gene orthology.
Nat RevGenet. , 14:360–366, 2013.[10] M Y Galperin, D M Kristensen, K S Makarova, Y I Wolf, and E V Koonin. Microbial genome analysis: theCOG approach.
Brief Bioinform. , 20:1063–1070, 2019.[11] Michael R Garey and David S Johnson.
Computers and intractability: A Guide to the Theory of NP-completeness . W. H. Freeman, San Francisco, 1979.[12] Manuela Geiß, Edgar Ch´avez, Marcos Gonz´alez Laffitte, Alitzel L´opez S´anchez, B¨arbel M R Stadler, Dulce I.Valdivia, Marc Hellmuth, Maribel Hern´andez Rosales, and Peter F Stadler. Best match graphs.
J. Math. Biol. ,78:2015–2057, 2019.[13] Manuela Geiß, Marcos E. Gonz´alez Laffitte, Alitzel L´opez S´anchez, Dulce I. Valdivia, Marc Hellmuth, MaribelHern´andez Rosales, and Peter F. Stadler. Best match graphs and reconciliation of gene trees with species trees.
J. Math. Biol. , 80:1459–1495, 2020.[14] Manuela Geiß, Peter F. Stadler, and Marc Hellmuth. Reciprocal best match graphs.
J. Math. Biol. , 80:865–953,2020.[15] Paweł G´orecki and Jerzy Tiuryn. DLS-trees: A model of evolutionary scenarios.
Theor. Comp. Sci. , 359:378–399, 2006.[16] R Guig´o, I Muchnik, and T F Smith. Reconstruction of ancient molecular phylogeny.
Mol Phylogenet Evol ,6:189–213, 1996.[17] Oskar Hagen and Tanja Stadler. TreeSimGM: Simulating phylogenetic trees under general Bellman-Harrismodels with lineage-specific shifts of speciation and extinction in R.
Methods Ecol. Evol. , 9:754–760, 2018.[18] Kousuke Hanada, Ayumi Tezuka, Masafumi Nozawa, Yutaka Suzuki, Sumio Sugano, Atsushi J Nagano, Mo-tomi Ito, and Shin-Ichi Morinaga. Functional divergence of duplicate genes several million years after geneduplication in arabidopsis . DNA Research , 25:327–339, 2018.[19] Marc Hellmuth. Biologically feasible gene trees, reconciliation maps and informative triples.
Alg Mol Biol ,12:23, 2017.[20] Marc Hellmuth, Adrian Fritz, Nicolas Wieseke, and Peter F. Stadler. Techniques for the cograph editing prob-lem: Module merge is equivalent to edit P ’s. Art Discr. Appl. Math. , 3:
Theor. Comp. Sci. , 809:384–393, 2020.[22] Marc Hellmuth, Maribel Hernandez-Rosales, Katharina T. Huber, Vincent Moulton, Peter F. Stadler, and Nico-las Wieseke. Orthology relations, symbolic ultrametrics, and cographs.
J. Math. Biol. , 66:399–420, 2013.[23] Marc Hellmuth, Nicolas Wieseke, Marcus Lechner, Hans-Peter Lenhof, Martin Middendorf, and Peter F.Stadler. Phylogenomics with paralogs.
Proc Natl Acad Sci USA , 112:2058–2063, 2015.[24] Maribel Hernandez-Rosales, Marc Hellmuth, Nick Wieseke, Katharina T. Huber, Vincent Moulton, and Peter F.Stadler. From event-labeled gene trees to species trees.
BMC Bioinformatics , 13(Suppl. 19):S6, 2012.[25] Stephanie Keller-Schmidt and Konstantin Klemm. A model of macroevolution as a branching process based oninnovations.
Adv. Complex Syst. , 15:1250043, 2012.[26] David G. Kendall. On the generalized birth-and-death process.
Ann. Math. Statistics , 19:1–15, 1948.
27] R M Kliman, P Andolfatto, J A Coyne, F Depaulis, M Kreitman, A J Berry, J McCarter, J Wakeley, and J Hey.The population genetics of the origin and divergence of the
Drosophila simulans complex species.
Genetics ,156:1913–1931, 2000.[28] Manuel Lafond, Cedric Chauve, Riccardo Dondi, and Nadia El-Mabrouk. Polytomy refinement for the correc-tion of dubious duplications in gene trees.
Bioinformatics , 30:i519–i526, 2014.[29] Manuel Lafond, Riccardo Dondi Dondi, and Nadia El-Mabrouk. The link between orthology relations and genetrees: A correction perspective.
Algorithms Mol Biol. , 11:4, 2016.[30] Manuel Lafond and Nadia El-Mabrouk. Orthology and paralogy constraints: satisfiability and consistency.
BMC Genomics , 15:S12, 2014.[31] Daiqing Liao. Concerted evolution: Molecular mechanisms and biological implications.
Am. J. Hum. Genet. ,64:24–30, 1999.[32] Yunlong Liu, Jianxin Wang, Jiong Guo, and Jianer Chen. Complexity and parameterized algorithms for cographediting.
Theor. Comp. Sci. , 461:45–54, 2012.[33] W. Maddison. Reconstructing character evolution on polytomous cladograms.
Cladistics , 5:365–377, 1989.[34] Terry A. McKee and F. R. McMorris.
Topics in Intersection Graph Theory . Society for Industrial and AppliedMathematics, 1999.[35] Bruno T. L. Nichio, Jeroniza Nunes Marchaukoski, and Roberto Tadeu Raittz. New tools in orthology analysis:A brief review of promising perspectives.
Front Genet. , 8:165, 2017.[36] Nikolai Nøjgaard, Manuela Geiß, Daniel Merkle, Peter F. Stadler, Nicolas Wieseke, and Marc Hellmuth. Time-consistent reconciliation maps and forbidden time travel.
Alg. Mol. Biol. , 13:2, 2018.[37] R D M Page and M A Charleston. Reconciled trees and incongruent gene and species trees.
DIMACS SerDiscrete Mathematics and Theor Comput Sci , 37:57–70, 1997.[38] Deng Pan and Liqing Zhang. Tandemly arrayed genes in vertebrate genomes.
Comp Funct Genomics ,2008:545269, 2008.[39] Alexander C J Roth, Gaston H Gonnet, and Christophe Dessimoz. Algorithm of OMA for large-scale orthologyinference.
BMC Bioinformatics , 9:518, 2008.[40] L. Y. Rusin, E. Lyubetskaya, K. Y. Gorbunov, and V. Lyubetsky. Reconciliation of gene and species trees.
BioMed Res Int. , 2014:642089, 2014.[41] Erfan Sayyari and Siavash Mirarab. Testing for polytomies in phylogenetic species trees using quartet frequen-cies.
Genes , 9:132, 2018.[42] C. Semple and M. Steel.
Phylogenetics , volume 24 of
Oxford Lecture Series in Mathematics and its Applica-tions . Oxford University Press, Oxford, UK, 2003.[43] Charles Semple. Reconstructing minimal rooted trees.
Discr. Appl. Math. , 127:489–503, 2003.[44] Jo˜ao C. Setubal and Peter F. Stadler. Gene phyologenies and orthologous groups. In Jo˜ao C. Setubal, Peter F.Stadler, and Jens Stoye, editors,
Comparative Genomics , volume 1704, pages 1–28. Springer, Heidelberg, 2018.[45] Patricia S. Soria, Kriston L. McGary, and Antonis Rokas. Functional divergence for every paralog.
Mol BiolEvol , 31:984–992, 2014.[46] Peter F Stadler, Manuela Geiß, David Schaller, Alitzel L´opez, Marcos Gonzalez Laffitte, Dulce Valdivia, MarcHellmuth, and Maribel Hernandez Rosales. From pairs of most similar sequences to phylogenetic best matches.
Alg. Mol. Biol. , 15:5, 2020.[47] K Takahashi, Y Terai, M Nishida, and N Okada. Phylogenetic relationships and ancient incomplete lineagesorting among cichlid fishes in Lake Tanganyika as revealed by analysis of the insertion of retroposons.
MolBiol Evol , 18:2057–2066, 2001.[48] Roman L. Tatusov, Eugene V. Koonin, and David J. Lipman. A genomic perspective on protein families.
Science , 278:631–637, 1997.[49] Dekel Tsur. Faster algorithms for cograph edge modification problems.
Inf. Processing Let. , 158:105946, 2020.
50] B Vernot, M Stolzer, A Goldman, and D Durand. Reconciliation with non-binary species trees.
J Comput Biol. ,15:981–1006, 2008.[51] R´emi Zallot, Katherine J. Harrison, Bryan Kolaczkowski, and Val´erie de Cr´ecy-Lagard. Functional annotationsof paralogs: A blessing and a curse.
Life , 6:39, 2016., 6:39, 2016.