The hybrid number of a ploidy profile
TTHE HYBRID NUMBER OF A PLOIDY PROFILE
KATHARINA T. HUBER AND LIAM J. MAHERA
BSTRACT . Polyploidization whereby an organism inherits multiple copies of the genomeof their parents is an important evolutionary event that has been observed in plants and an-imals. One way to do this is in terms of the ploidy level of the species that make up adataset of interest. It is therefore natural to ask the following question which we addresshere: How much information about the evolutionary past of a dataset can be gleaned fromthe ploidy levels of the species that make up the dataset? To help answer this question weintroduce and study the novel concept of a ploidy profile which allows us to formalize ourquestion in terms of a multiplicity vector indexed by the species that make up the dataset.Using the framework of a phylogenetic network, our results include a closed formula forcomputing a ploidy profile’s hybrid number (i.e. the minimal number of polyploidiza-tion events required to explain a polyploid dataset). Reflecting the fact that that formularelies on the availability of a prime factor decomposition of a number we also present anon-trivial upper bound for it.
1. I
NTRODUCTION
Datasets such as the Sileni dataset considered in [23] arise when species inherit multiplesets of chromosomes from their parents. Generally referred to as polyploidization this canbe due to whole genome duplication as in the case of e.g. watermelons and bananas [28]or by obtaining an additional set of chromosomes via hybridization as is the case of thefrog genus Xenopus [20]. This poses the following intriguing question at the center ofthis paper which is related to the well-studied problem of finding the hybrid number of aset of phylogenetic trees (i.e. leaf-labelled rooted trees without any vertices of indegreeand outdegree one) [3]: How much information about the evolutionary past of a datasetcan be gleaned from the ploidy number (i.e. the number of copies of a chromosome ina genome) of the species that make up the dataset? Evoking parsimony to capture theidea that polyploidization is a relatively rare evolutionary event we precise this question asfollows: What is the minimum number of polyploidization events necessary to explain adataset’s observed ploidy profile , that is, the ploidy numbers of the species that make upthe dataset?As it turns out, an answer to this question is well-known if the ploidy profile in questionis presented in terms of a multi-labelled tree (see e.g. [8, 12, 16, 17]). Since it is howevernot always clear how to derive such a tree from the dataset in the first place [10] we focushere on ploidy profiles for which such a tree is not necessarily available. More precisely,we view the ploidy profile of a dataset X as a multiplicity vector (cid:126) m = ( m , . . . , m n ) , n = | X | , Date : January 27, 2021.2020
Mathematics Subject Classification.
Key words and phrases.
Keywords: Phylogenetic network, hybrid number, further. a r X i v : . [ q - b i o . P E ] J a n KATHARINA T. HUBER AND LIAM J. MAHER indexed by the species in X where, for each 1 ≤ i ≤ n , the ploidy number of species i ∈ X is m i .Due to the reticulate nature of the signal left behind by polyploidization [15, 19, 22],phylogenetic networks on some set X of species (see [1, 6, 9, 14, 24, 25, 29] for methodol-ogy and construction algorithms surrounding phylogenetic networks) offer themselves asa natural framework to formalize and answer our question. Such networks are essentiallybinary rooted directed acyclic graphs whose leaf set is X and whose hybrid vertices , thatis, vertices with indegree two and outdegree one, represent reticulate evolutionary eventssuch as polyploidization (see e.g. Figure 1 for an example of a phylogenetic network where X = { a , b , c , d } ). Note that to be able to account for autopoliploidization whereby a wholegenome is duplicated, we also allow such graphs to have parallel arcs (see e.g. [7, 26] forfurther results concerning such networks).By taking for every leaf x of a phylogenetic network N on some finite set X the numberof directed paths from the root of N to x , every phylogenetic network induces a multiplicityvector indexed by the elements in X (see [21] where the assumption that the number ofdirected paths from every vertex of N to every leaf of N is known is formalized as anancestral profile for N and [5] for the usage of multiplicity vectors to define a metric forphylogenetic networks). Taken together, this suggests the following formalization of ourquestion. Given a ploidy profile (cid:126) m indexed by the elements of some finite set X thenwhat can be said about the hybrid number h ( (cid:126) m ) of (cid:126) m , that is, the minimum number ofhybrid vertices required by a binary phylogenetic network N on X to realize (cid:126) m , that is, themultiplicity vector induced by N as described above is (cid:126) m ?To illustrate this question, consider the ploidy profile (cid:126) m = ( , , , ) indexed by X = { a , b , c , d } where the multiplicity of a is 13, that of b and c is 6, and that of d is 5. Then it isstraight-forward to see that any binary phylogenetic network N on X that realizes (cid:126) m mustcontain a directed path from ρ to a that contains at least four hybrid vertices to account forthe multiplicity of a . Since, when viewed as a set, the size of (cid:126) m is three it follows that atleast two more hybrid vertices are needed by N to account for the multiplicities of b (and,therefore also of c ) and d . Thus, to realize (cid:126) m at least six hybrid vertices are required by N .As is easy to check, the directed acyclic graph N ( (cid:126) m ) depicted in Figure 1F IGURE
1. One of potentially many phylogenetic networks that realizethe ploidy profile (cid:126) m = ( , , , ) on X = { a , b , c , d } . To improve clarityof exposition, we always assume that arcs are directed away from theroot. HE HYBRID NUMBER OF A PLOIDY PROFILE 3 is a phylogenetic network that realizes (cid:126) m and which postulates six hybrid vertices. Infact this bound is sharp for (cid:126) m . We shall see that, as a consequence of Theorem 5.1 andCorollary 6.4, this is not a coincidence.The outline of the paper is as follows. After presenting some relevant basic terminologyand notations concerning phylogenetic networks in the next section, we study in Section 3structural properties of phylogenetic networks that attain the hybrid number of a ploidyprofile. As part of this, we present a construction of a phylogenetic network D ( (cid:126) m ) thatattains that number for special types of ploidy profiles (cid:126) m (Theorem 3.4).In Section 4, we introduce the notion of a simplification sequence σ ( (cid:126) m ) of a ploidyprofile (cid:126) m and present some basic results concerning such sequences including an infinitefamily of ploidy profiles that shows that such a sequence can grow exponentially large.Denoting the last element of such a sequence for (cid:126) m by (cid:126) m (cid:48) we then employ a tracebackthrough σ ( (cid:126) m ) , starting at (cid:126) m (cid:48) , to obtain the network N ( (cid:126) m ) from D ( (cid:126) m (cid:48) ) . After collectingsome preliminary results for N ( (cid:126) m ) in that section, we establish in Section 5 that, in a well-defined sense, N ( (cid:126) m ) is optimal (Theorem 5.1).Since the construction of D ( (cid:126) m (cid:48) ) relies on being able to find the prime factor decompo-sition of a number which is not guaranteed, we study in Section 6 an approximization of D ( (cid:126) m ) in terms of a network B ( (cid:126) m ) , the construction of which exploits a binary represen-tation of a number (Theorem 6.2). As a consequence, we obtain an upper bound on thehybrid number h ( (cid:126) m ) of (cid:126) m (Corollary 6.3). We complement this by presenting sharp boundsfor h ( (cid:126) m ) for special cases of (cid:126) m (Proposition 6.4).2. P RELIMINARIES
We start with introducing basic concepts surrounding phylogenetic networks and thenbriefly describe two basic operations concerning phylogenetic networks. For the con-venience of the reader, we illustrate them in Figures 1 and 2 by means of an example.Throughout the paper, we assume that X is a non-empty finite set, and we denote the sizeof X by n .2.1. Basic concepts.
Suppose for the following that G is a rooted directed connectedacyclic graph which might contain parallel arcs but no loops. Then we denote the ver-tex set of G by V ( G ) and its set of arcs by A ( G ) . We denote an arc a ∈ A ( G ) starting at avertex u and ending in a vertex v by ( u , v ) and refer to u as the tail of a and to v as the head of a .Suppose v ∈ V ( G ) . Then we refer to the number of arcs coming into v as the indegree of v denoted by indeg G ( v ) and the number of outgoing arcs of v as the outdegree of v , denotedby outdeg G ( v ) . If G is clear from the context then we will omit the subscript in indeg G ( v ) and outdeg G ( v ) , respectively. We call v the root of G , denoted by ρ G , if indeg ( v ) =
0, andwe call v a leaf of G if indeg ( v ) = outdeg ( v ) =
0. We denote the set of leaves of G by L ( G ) . We call v a tree vertex if outdeg ( v ) = indeg ( v ) =
1. And we call v a hybridvertex if indeg ( v ) ≥ outdeg ( v ) =
1. We denote the set of hybrid vertices of G by H ( G ) . We say that G is binary if, outdeg ( ρ G ) = v ∈ V ( G ) − L ( G ) other than ρ G , we have that the degree sum is three. We say that a vertex w ∈ V ( G ) is above v if there KATHARINA T. HUBER AND LIAM J. MAHER exists a directed path P from w to v . In that case, we also say that v is below w . If P has atleast two vertices then we say that w is strictly above v and that v is strictly below w . Werefer to a vertex w ∈ V ( G ) as an ancestor of v if w is above v . Note that a vertex can be itsown ancestor.We call G a phylogenetic network (on X) if L ( G ) = X , every vertex v ∈ V ( G ) − L ( G ) other than ρ G is a tree vertex or a hybrid vertex and, unless n = outdeg ( ρ G ) ≥
2. Notethat phylogenetic networks in our sense were called semi-resolved phylogenetic networksin [8] and that our definition of a phylogenetic network differs from the standard definitionof such an object (see e.g. [25]) by allowing parallel arcs. Suppose G is a phylogeneticnetwork on X . Then following [3], we define the hybrid number h ( G ) of G to be h ( G ) = ∑ h ∈ H ( G ) ( indeg ( h ) − ) . We refer to a phylogenetic network G (on X ) as a phylogenetic tree (on X) if h ( G ) = T is a phylogenetic tree on X and that Y ⊆ X is a non-empty subset. Thenwe refer to the unique vertex v in T that is an ancestor of every element in Y such thatthere is no vertex (strictly) below v that is also an ancestor of every element in Y as the lastcommon ancestor of Y , denoted by lca T ( Y ) . If Y = { y , . . . , y r } , some 1 ≤ r ≤ | X | we alsowrite lca ( y , . . . , y r ) rather than lca ( Y ) . We denote for any non-root vertex v ∈ V ( T ) thesubtree of T obtained by deleting the incoming arc of v that contains v by T ( v ) .Suppose that N is a phylogenetic network on X . Then we denote the number of directedpaths from the root ρ N of N to a leaf x of N by m N ( x ) . In case N is clear from the context,we will write m ( x ) rather than m N ( x ) . For N (cid:48) a further phylogenetic network on X we saythat N and N (cid:48) are equivalent if there exists a graph isomorphism between N and N (cid:48) that isthe identity on X . Furthermore, we say that N (cid:48) is a resolution of N if N (cid:48) is obtained from N by resolving all vertices in H ( N ) so that every vertex in H ( N (cid:48) ) has indegree two. Notethat for any resolution N (cid:48) of N , we have h ( N ) = | H ( N (cid:48) ) | .2.2. Two basic constructions.
Phylogenetic trees on X were generalized in [8] to socalled multi-labelled trees (on X) or MUL-trees (on X) , for short, by replacing the leafset of a phylogenetic tree by a multiset Y on X , that is, X is the set obtained from Y byignoring multiplicities. As was pointed out in the same paper, every phylogenetic network N gives rise to a multi-labelled tree U ( N ) on X by recording, for every vertex v of N , everydirected path from the root ρ N of N to v . More precisely, the vertex set of U ( N ) is the setof all directed paths from ρ N to any vertex of N and two vertices P and P (cid:48) in U ( N ) arejoined by an arc ( P (cid:48) , P ) if there exists an arc a ∈ A ( N ) such that P is obtained from P (cid:48) byextending P (cid:48) by the arc a .Intriguingly, the above construction can also be reversed as we outline briefly next (see[8] for details and [11, 13] for more on both constructions). We start with some termi-nology. Suppose that M is a multi-labelled tree on X . Then there must exist at least twovertices u , v ∈ V ( M ) distinct such that the induced MULtrees T ( u ) and T ( v ) of M rootedat v and u , respectively, are equivalent copies of each other (where we canonically extendthe notion of the subtree T ( v ) induced by a vertex v in a phylogenetic tree to a that of asubMUL-tree induced by a vertex in a MUL-tree). Assume that u and v form an identifiable HE HYBRID NUMBER OF A PLOIDY PROFILE 5 pair , that is, there exist no vertices v (cid:48) ∈ V ( M ) strictly above v and u (cid:48) ∈ V ( M ) strictly above u such that the subMULtrees of M induced by u (cid:48) and v (cid:48) are equivalent copies of each other.Then we store one of the equivalent copies of T ( u ) in a sequence γ M of subMULtrees of M which we call a fold-up sequence for M . That sequence is initialized with the emptysequence at the beginning of the construction of the phylogenetic network from M .Let V ⊆ V ( M ) denote the set of vertices w ∈ V ( M ) such that the subMUL-tree inducedby w is an equivalent copy of T ( u ) . Then, for all w ∈ V , we first subdivide the incoming arcof w by a vertex h w and then identify all vertices h w , w ∈ V . Clearly, the resulting vertex,call it h v , has | V | outgoing arcs. From these | V | outgoing arcs of h v we delete all but onearc and, for each deleted arc a , remove the subMULtree T ( w (cid:48) ) rooted by the head w (cid:48) of a .Denoting the remaining subMULtree rooted by h v by T ( h v ) we then grow γ M by adding anequivalent copy of T ( h v ) at the end of γ M in case γ M is not the empty sequence. Otherwisewe add T ( h v ) as the first element to γ M . Replacing M with the resulting graph N V we thenfind a new identifiable pair of vertices and proceed as before (where we canonically extendnotions of an identifiable pair and of a subMUL-tree rooted at a vertex to N V ).Clearly the process of subdividing, identifying, and deleting terminates in a phyloge-netic network F ( M ) on X . As was established in [8], F ( M ) is independent of the orderin which ties concerning the order in which identifiable pairs are resolved are processed.Note that all tree vertices of F ( M ) have outdegree two because M is a binary MUL-tree.However F ( M ) might contain hybrid vertices whose indegree is two or more.To illustrate both constructions, we picture for the ploidy profile (cid:126) m = ( , , , ) inFigure 2 the multi-labelled tree U ( N ( (cid:126) m )) associated to the phylogenetic network N ( (cid:126) m ) depicted in Figure 1. That network is a resolution of F ( U ( N ( (cid:126) m ))) .F IGURE
2. (i) The multi-labelled tree U = U ( N ( (cid:126) m )) for the ploidyprofile (cid:126) m = ( , , , ) considered in Figure 1. (ii) the first step in theconstruction of F ( U ) from U where ( u , v ) is the pair of identifiable ver-tices indicated in (i).3. P ROPERTIES OF PHYLOGENETIC NETWORKS THAT ATTAIN THE HYBRID NUMBEROF A PLOID PROFILE
In this section, we collect structural properties of phylogenetic networks that attain thehybrid number of a ploidy profile (cid:126) m . Calling a ploidy profile (cid:126) m = ( m , . . . , m n ) simple if m ≤ ≤ i ≤ n and strictly simple if | X | =
1, we also introduce a construction of aphylogenetic network D ( (cid:126) m ) which is guaranteed to attain that number in case (cid:126) m is simple. KATHARINA T. HUBER AND LIAM J. MAHER
We start with some notations and definitions. Suppose N is a binary phylogenetic net-work on X = { x , . . . , x n } . Then we say that N realizes a ploidy profile (cid:126) m = ( m , . . . , m n ) on X if m i ≥
1, for all 1 ≤ i ≤ n , and the elements in X can be ordered in such a waythat m i = m ( x i ) holds for all 1 ≤ i ≤ n . We say that N attains h ( (cid:126) m ) if N realizes (cid:126) m and h ( (cid:126) m ) = h ( N ) . In that case, we refer to N as an attainment of (cid:126) m .For ease of readability we will assume from now on that for a ploidy profile (cid:126) m =( m , . . . , m n ) on X the elements in X are always ordered in such a way that m ( x i ) = m i holds for all 1 ≤ i ≤ n and that (cid:126) m is in descending order , that is, m i ≥ m i + holds for all1 ≤ i ≤ n − Lemma 3.1.
For any ploidy profile (cid:126) m = ( m , . . . , m n ) on X we have ≤ h ( (cid:126) m ) ≤ ∑ ni = ( m i − ) .Proof. Suppose (cid:126) m = ( m , . . . , m n ) is a ploidy profile on X . Then 0 ≤ h ( (cid:126) m ) clearly holds.We prove the stated upper bound by providing a construction of a MUL-tree M ( (cid:126) m ) from (cid:126) m such that h ( F ( M ( (cid:126) m ))) = ∑ ni = ( m i − ) . For the convenience of the reader we illustrate thisconstruction in Figure 3 in terms of a ploidy profile (cid:126) m = ( m , . . . , m n ) on X = { x , . . . , x n } where, for all 1 ≤ i ≤ n , C i denotes the caterpillar MUL-tree on { x i } (i.e. C i has a uniquecherry) with m i leaves.F IGURE
3. (i) The MUL-tree M ( (cid:126) m ) on X = { x , . . . , x n } for the ploidyprofile (cid:126) m = ( m , . . . , m n ) on X . For all 1 ≤ i ≤ n the caterpillar MUL-tree C i on { x i } is indicated C i . (ii) The phylogenetic network F ( M ( (cid:126) m )) associated to (cid:126) m .For all 1 ≤ i ≤ n we first construct the caterpillar MUL-trees C i on { x i } with m i leaves.With ρ C i denoting the root of C i we then add an arc ( ρ , ρ C i ) to a new vertex ρ not alreadycontained in (cid:83) ni = V ( C i ) to obtain a MUL-tree on X . If the outdegree of ρ is two then thattree is M ( (cid:126) m ) . Otherwise, we arbitrarily resolve ρ to obtain a binary MUL-tree on X andthat tree is M = M ( (cid:126) m ) . Note that h ( F ( M )) is independent of a chosen resolution of ρ . Byconstruction, any resolution of F ( M ( (cid:126) m )) clearly realizes (cid:126) m . (cid:3) HE HYBRID NUMBER OF A PLOIDY PROFILE 7
As we shall see in Proposition 6.4, the bound in Lemma 3.1 can be improved for manyploidy profiles.
Lemma 3.2.
Suppose (cid:126) m = ( m , . . . , m n ) is a ploidy profile on X = { x , . . . , x n } and that Nis an attainment of (cid:126) m. Then there must exist a directed path P from ρ N to x such that everyvertex in H ( N ) lies on P.Proof. Assume for contradiction that (cid:126) m is a ploidy profile and that N is a phylogeneticnetwork that attains h ( (cid:126) m ) but that no directed path from ρ N to x contains all vertices of H ( N ) . Without loss of generality, we may assume for all 1 ≤ i ≤ n − m i > m i + .Clearly, h ( N ) ≥ H ( N ) must be above x .We start with useful structural assumptions that we can make on N . Let P denote adirected path from the root ρ N of N to x that contains a vertex h ∈ H ( N ) . Then, byassumption, there must exist a vertex h ∈ H ( N ) that is not contained in P . For i = , v on P such that there exists a directed path P i from v to h i such that P and P only have v as a common vertex. Without loss of generality we mayassume that, over all directed paths from ρ N to x , the vertex v is as close as possible to ρ N ,that is, there exists no vertex v (cid:48) strictly above v that enjoys the same properties as v withregards to the analogously defined directed paths P (cid:48) an P (cid:48) starting at v (cid:48) . Without loss ofgenerality we may assume that P is such that the induced directed subpath P (cid:48) from v to x via h contains as many hybrid vertices as possible. Furthermore we may assume withoutloss of generality that h i is such that every vertex other than h i on the induced subpath from v to h i is a tree vertex. Combined with a case analysis concerning the structure of a directedpath from v to h i , i ∈ { , } , it follows that we may also assume without loss of generalityfor all i ∈ { , } that h i is such that there exists no directed path from v to h i that containsa vertex in H ( N ) − { h i } .Note that the assumption on P (cid:48) implies that there cannot exist a directed path from h to h as such a path would contain at least one more hybrid vertex. Furthermore, for all i ∈ { , } , the assumption on v and the choice of h i implies that there cannot exist a hybridvertex in H ( N ) that is strictly above h i but below v . Thus, for all i ∈ { , } , every vertex w of N below v and strictly above h i must be a tree vertex of N . Consequently, the number ofdirected path from ρ N to v equals the number of directed paths from ρ N to w . Finally, forall i ∈ { , } , the assumption that v is as close as possible to the root ρ N of N implies forall i ∈ { , } that every directed path from ρ N to h i must contain v . With m (cid:48) denoting thenumber of directed paths from ρ N to v the above assumptions imply for all i ∈ { , } thatthe number of directed paths from ρ N to h i is 2 m (cid:48) .Consider the phylogenetic network N (cid:48) obtained from N as follows. First subdivide theoutgoing arc of h by a vertex u and delete the outgoing arc a of h . Next, identify h and h and add an arc from u to the head of a . Finally for all i ∈ { , } merge the directed pathsfrom v to h i in such a way that if w and w (cid:48) are distinct vertices on the paths to be mergedbut w , w (cid:48) (cid:54)∈ { v , h , h } then if w is above w (cid:48) before the merger then w is not below w (cid:48) afterthe merger. Then, by construction, N (cid:48) is binary and realizes (cid:126) m . Since h ( N (cid:48) ) = h ( N ) − N attains h ( (cid:126) m ) . (cid:3) KATHARINA T. HUBER AND LIAM J. MAHER
To state the next lemma we call an arc a in a directed graph G a cut-arc of G if thedeletion of a partitions the vertex set of G into two subsets. We call a cut-arc a of G trivial if the head of a is a leaf. Lemma 3.3.
Suppose (cid:126) m = ( m , . . . , m n ) is a simple ploidy profile on X such that m is aprime number. Then any cut-arc in an attainment of (cid:126) m must be trivial.Proof. Suppose N is an attainment of (cid:126) m . Then we may assume without loss of generalitythat (cid:126) m is strictly simple since, for all 2 ≤ i ≤ n , the leaf in N indexing m i is the head ofa trivial cut-arc of N . Put X = { x } and m = m . Then the lemma clearly holds in case m ∈ { , } . So assume that m ≥ N has a non-trivial cut-arc a . Without loss of generalitywe may assume that a is as close to the root ρ N of N as possible. Let N and N denote therooted directed graphs obtained from N by deleting a . Without loss of generality we mayassume that N contains the root ρ N of N . Then N is clearly a phylogenetic network on X . Furthermore N can be turned into a phylogenetic network N (cid:48) on some set Y = { y } byadding a leaf y and an arc from the tail of a to y . As is easy to see, m = m N (cid:48) ( y ) × m N ( x ) .Since 1 (cid:54)∈ { m N (cid:48) ( y ) , m N ( x ) } and m is prime this is impossible. (cid:3) To be able to state the main result of this section (Theorem 3.4) we require furtherterminology. Suppose that m is a positive integer and that, for all 1 ≤ i ≤ k , p i is a primeand α i ≥ p α p α · . . . · p α k k is a prime factor decomposition of m .Without loss of generality, we may assume that the primes are indexed in such a way that,for all 1 ≤ i ≤ k − p i < p i + .For all 1 ≤ i ≤ k , let N i = N p i denote a binary phylogenetic network with a single leafon Y = { x } that attains h ( (cid:126) p i ) for a strictly simple ploidy profile (cid:126) p i = ( p i ) on Y . Let N α i i denote the phylogenetic network on Y obtained by taking the root ρ i of N i to be the root of N α i i . If α i = N α i i to be N i . If α i ≥ α i equivalent copies of N i and order them in some way. Next, we identify the unique leaf of the first of the α i copiesof N i with the root of the second copy of N i and so on until we have processed all α i copiesof N i this way.Finally, we associate a rooted directed acyclic graph D ( (cid:126) m ) with leaf set X to a simpleploidy profile (cid:126) m = ( m , m , . . . , m n ) on X as follows. Let p α p α · . . . · p α k k denote a primefactor decomposition of m = m as described above. Then the root ρ of D ( (cid:126) m ) is the rootof N α . If k = D ( (cid:126) m ) to be N α . Otherwise we define D ( (cid:126) m ) to be the rooteddirected graph obtained by identifying, for all 1 ≤ i ≤ k −
1, the unique leaf of N α i i with theroot of N α i + i + . Clearly D ( (cid:126) m ) is a phylogenetic network on { x } .Unless (cid:126) m is strictly simple, in either case, we process D ( (cid:126) m ) further by subdividing thetwo outgoing arcs of ρ by n − s , . . . , s n (ensuring that each of thetwo outgoing arcs of ρ is subdivided at least once in case n > ≤ i ≤ n we add the arcs ( s i , x i ) to D ( (cid:126) m ) to obtain a phylogenetic network on X . HE HYBRID NUMBER OF A PLOIDY PROFILE 9
For the convenience of the reader, we depict for the simple ploidy profile (cid:126) m = ( , , ) on X = { x , x , x } the network D ( (cid:126) m ) on X in Figure 4.F IGURE
4. For the simple ploidy profile (cid:126) m = ( , , ) on X = { x , x , x } , we depict the phylogenetic network D ( (cid:126) m ) on X which,in this case, is equivalent with the network B ( (cid:126) m ) introduced in Sec-tion 6. Viewing (cid:126) m as the last element in σ ( (cid:126) m ) for the ploidy profile (cid:126) m = ( , , , ) on X = { a , b , c , d } implies that the elements in X arecertain phylogenetic trees on subsets of X . Theorem 3.4.
Suppose (cid:126) m = ( m , . . . , m n ) is a simple ploidy profile on X such that m ≥ .Then D ( (cid:126) m ) is an attainment of (cid:126) m.Proof. Note first that, by construction, D ( (cid:126) m ) is a binary phylogenetic network on X thatrealizes (cid:126) m . Assume for contradiction that there exists some simple ploidy profile (cid:126) m =( m , . . . , m n ) on X such that D ( (cid:126) m ) does not attain h ( (cid:126) m ) . Let Q denote a phylogeneticnetwork on X that attains h ( (cid:126) m ) . Note that we may assume without loss of generality that (cid:126) m is strictly simple. Hence, (cid:126) m = ( m ) where m = m and X = { x } . As is easy to see, m ≥ m is as small as possible suchthat D ( (cid:126) m ) does not attain h ( (cid:126) m ) . For 1 ≤ i ≤ k let p i be a prime and α i a positive integersuch that p α p α · . . . · p α k k is a prime factor decomposition of m . Without loss of generalitywe may assume again that, for all 1 ≤ i ≤ k −
1, we have that p i < p i + .Let m (cid:48) = p α − p α · . . . · p α k k . Then, by the choice of m , it follows that h ( (cid:126) m (cid:48) ) is attainedby D ( (cid:126) m (cid:48) ) for the strictly simple ploidy profile (cid:126) m (cid:48) = ( m (cid:48) ) . Let γ Q = ( T , T , . . . , T l ) , some l ≥
1, denote a fold-up sequence for U ( Q ) . Note that, by the minimality of Q we have that Q is a resolution of F ( U ( Q )) . In view of Lemma 3.2, it follows for all 1 ≤ i ≤ l − T i + is a subMULtree of T i .To obtain the required contradiction we next modify U ( Q ) into a new MUL-tree M . Tothis end, we distinguish the cases that m is even and that m is odd.Assume first that m is even. Then there exists some positive integer k such that m = k .Since Q attains h ( (cid:126) m ) it follows that the number of leaves below each of the two children c and c of the root ρ U ( Q ) must be k . For i = ,
2, let R i denote the subMULtree of U ( Q ) rooted at c i . Note that R and R must be equivalent. Hence, h ( F ( R )) = h ( F ( R )) . TheMUL-tree obtained by replacing both R and R by U = U ( D ( (cid:126) k )) where (cid:126) k is the strictlysimple ploidy profile ( k ) is M in this case. Note that since D ( (cid:126) k ) is a resolution of F ( U ) the choice of m implies that D ( (cid:126) k ) attains h ( (cid:126) k ) . By the choice of m and Q it follows that h ( Q ) < h ( D ( (cid:126) m )) = + h ( D ( (cid:126) k )) ≤ + h ( F ( R )) < h ( Q ) which is impossible. Thus, D ( (cid:126) m ) attains h ( (cid:126) m ) in this case.Assume for the remainder that m is odd. Then there exists some positive integer k suchthat m = k +
1. Hence, there must exist a leaf l of U ( Q ) that is not contained in a cherry.Let U (cid:48) denote the MUL-tree obtained from U ( Q ) by deleting l and its incoming arc. Ifthe parent p of l is not the root ρ U ( Q ) of U ( Q ) then we also suppress p . If p = ρ U ( Q ) thenwe collapse the unique outgoing arc of p . Let (cid:126) k denote the strictly simple ploidy profile ( k ) and let (cid:126) k + ( k + ) . By the choice of m it follows that h ( D ( (cid:126) k )) ≤ h ( F ( U (cid:48) )) = h ( N (cid:48) ) for any resolution N (cid:48) of F ( U (cid:48) ) . Combinedwith the choice of Q we obtain h ( Q ) < h ( D ( (cid:126) k + )) = h ( D ( (cid:126) k )) + ≤ h ( F ( U (cid:48) )) + = h ( Q ) where the last equality follows since at least one hybrid vertex has to be added to aresolution of F ( U (cid:48) ) so that the resulting phylogenetic network N (cid:48)(cid:48) realizes (cid:126) m . The MUL-tree U ( N (cid:48)(cid:48) ) is M in this case. But this chain of inequalities is impossible. Thus, D ( (cid:126) m ) mustalso attain h ( (cid:126) m ) in this case. (cid:3) Combined with Lemma 3.2, Theorem 3.4 implies the next result where we say that aphylogenetic network N is semi-stable if N is equivalent with a resolution of F ( U ( N )) .Note that phylogenetic networks N that is equivalent with F ( U ( N )) are called stable [11]. Corollary 3.5.
Suppose (cid:126) m = ( m , . . . , m n ) is a simple ploidy profile on X such that m ≥ .Then D ( (cid:126) m ) is semi-stable.
4. T
HE NETWORK N ( (cid:126) m ) ASSOCIATED TO A PLOIDY PROFILE (cid:126) m To help establish a formula for computing the hybrid number of a ploidy profiles (cid:126) m , westart by associating a binary phylogenetic network N ( (cid:126) m ) on X to a ploidy profile (cid:126) m on X that realizes (cid:126) m . This network is recursively defined via the two-phase process which wedescribe below and outline next.Suppose (cid:126) m = ( m , . . . m n ) is a ploidy profile on X . Then, in the first phase, we recur-sively generate a simple ploidy profile (cid:126) m = ( m , . . . , m n ) from (cid:126) m . This process is capturedby constructing a certain sequence of ploidy profiles for (cid:126) m which terminates in (cid:126) m andwhose first element is (cid:126) m . As we shall see, such a sequence must necessarily be unique.To reflect this, we call it a simplification sequence (for (cid:126) m) and denote it by σ ( (cid:126) m ) . In thesecond phase we then use a trace back through σ ( (cid:126) m ) to construct N ( (cid:126) m ) from D ( (cid:126) m ) . Forthe convenience of the reader we depict the phylogenetic networks associated to the firstand last element of the simplification sequence for (cid:126) m = ( , , , ) in Figures 1 and 4,respectively. Phase I: Construction of a simplification sequence σ ( (cid:126) m ) for (cid:126) m : Assume that all ploidyprofiles in σ ( (cid:126) m ) have already been constructed up to and including a ploidy profile (cid:126) m (cid:48) . If (cid:126) m (cid:48) is simple then (cid:126) m (cid:48) is the last element of σ ( (cid:126) m ) . HE HYBRID NUMBER OF A PLOIDY PROFILE 11
Assume for the remainder that (cid:126) m (cid:48) is not simple. Let (cid:126) m = ( m , . . . , m q ) , some q ≥ Y = { y , . . . , y q } where q ∈ { n − , n } that is thesuccessor of (cid:126) m (cid:48) in σ ( (cid:126) m ) . Without loss of generality, we may assume that (cid:126) m = (cid:126) m (cid:48) . Notethat m (cid:54) =
1. Put α = m − m . To obtain (cid:126) m , we distinguish the following cases:If α = m (and therefore also its index x ) from (cid:126) m and replace x i + by y i , 1 ≤ i ≤ n −
1. Putting q = n −
1, the resulting vector is clearly indexed by Y and indescending order. That vector is the ploidy profile (cid:126) m in this case.Assume for the remainder that α >
0. Then to obtain (cid:126) m we insert α into (cid:126) m as follows.If α > m then we replace m by α and index α by y . Furthermore, we replace x i by y i , i ∈ { , . . . , n } . The resulting vector is clearly indexed by Y and, again, in descending order.That vector is the ploidy profile (cid:126) m in this case.If α ≤ m then we delete m (and therefore also x ), replace x by y , and put m = m .Let j ∈ { , . . . , n } such that m j ≥ α > m j + . Then we insert α between m j and m j + andindex it with y j + . Furthermore, for all 3 ≤ i ≤ j +
1, we replace x i with y i − , and, for all j + ≤ i ≤ q , we replace x i with y i . The resulting vector is clearly indexed by Y and indescending order. That vector is the ploidy profile (cid:126) m in this final case.Note that in either case, this process terminates when | Y | = Y ≥ m = σ ( (cid:126) m ) .To describe the construction of N ( (cid:126) m ) from σ ( (cid:126) m ) suppose that the last element in σ ( (cid:126) m ) is the simple ploidy profile (cid:126) m = ( m , m , . . . , m l ) , some l ≥
1. As outlined above, we per-form a trace back through σ ( (cid:126) m ) starting at (cid:126) m for this. Phase II: Construction of N ( (cid:126) m ) from σ ( (cid:126) m ) : If (cid:126) m = (cid:126) m then (cid:126) m is simple and we define N ( (cid:126) m ) to be D ( (cid:126) m ) .Assume for the remainder that (cid:126) m (cid:54) = (cid:126) m . Suppose that (cid:126) m (cid:48) is the ploidy profile in σ ( (cid:126) m ) such that N ( (cid:126) m (cid:48) ) has not been constructed yet and that N ( (cid:126) m ) has already been constructedfor the successor (cid:126) m = ( m , . . . , m q ) of (cid:126) m (cid:48) in σ ( (cid:126) m ) .Assume first that (cid:126) m = (cid:126) m (cid:48) . Let Y = { y , . . . , y q } , some q ∈ { n , n − } , denote the set thatindexes (cid:126) m . We distinguish the cases that α : = m − m = α (cid:54) = α = y of N ( (cid:126) m ) by the cherry { x , x } and, for all 2 ≤ i ≤ q ,relabel y i by x i + . By construction, N ( (cid:126) m ) clearly realizes (cid:126) m in this case as N ( (cid:126) m ) realizes (cid:126) m . So assume that α (cid:54) =
0. Then either α > m or α ≤ m .If α > m then m = α (and therefore is indexed by y ) and m = m (and therefore isindexed by y ). We replace y i by x i , for all 1 ≤ i ≤ q and subdivide the incoming arc of x by a new vertex u . Next, we subdivide the incoming arc of y by a vertex v and add thenew arc ( v , u ) . The resulting rooted directed acyclic graph is N ( (cid:126) m ) in this case. If α ≤ m then m = m and, therefore, is indexed by y . Let j be as in the case α ≤ m of the first phase of the construction of N ( (cid:126) m ) . Then we subdivide the incoming arc of y j + by a new vertex v and replace y by the cherry { x , x } . Next, we subdivide the incomingarc of x by a new vertex u . Finally, we add an arc ( v , u ) and delete y j + and its incomingarc ( u , y j + ) (suppressing the resulting indegree and outdegree one vertex). The resultingrooted directed acyclic graph is N ( (cid:126) m ) in this final case.Note that in either case N ( (cid:126) m ) is binary and realizes (cid:126) m because N ( (cid:126) m ) is binary andrealizes (cid:126) m . This completes the construction of N ( (cid:126) m ) in case (cid:126) m = (cid:126) m (cid:48) .If (cid:126) m (cid:48) (cid:54) = (cid:126) m then we repeat the above construction with (cid:126) m (cid:48) replaced by the predecessorof (cid:126) m (cid:48) in σ ( (cid:126) m ) and (cid:126) m replaced by (cid:126) m (cid:48) until we obtain (cid:126) m = (cid:126) m (cid:48) . Note that this process musteventually stop since at each step in the construction of N ( (cid:126) m ) at least one component in aploidy profile in σ ( (cid:126) m ) is decreased. Similar argument as in the previous case imply that N ( (cid:126) m ) is clearly also a binary phylogenetic network on X that realizes (cid:126) m in this case. Thiscompletes the construction of N ( (cid:126) m ) in case (cid:126) m (cid:54) = (cid:126) m (cid:48) , and thus, the construction of N ( (cid:126) m ) from σ ( (cid:126) m ) .To illustrate the construction of N ( (cid:126) m ) from a simplification sequence for a ploidy profile (cid:126) m , consider the ploidy profile (cid:126) m = ( , , , ) on X = { a , b , c , d } . Then the sequence ( (cid:126) m , ( , , , ) , ( , , , ) , ( , , ) , ( , , )) is the simplification sequence σ ( (cid:126) m ) associated to (cid:126) m . For (cid:126) m = ( , , ) the phylogenetic network on X = { x , x , x } pictured in Figure 4 is D ( (cid:126) m ) . Note within this context, x , x and x now represent certain phylogenetic trees onsubsets of X . The phylogenetic network displayed in Figure 1 is the phylogenetic network N ( (cid:126) m ) constructed from σ ( (cid:126) m ) .The construction of N ( (cid:126) m ) implies immediately the next result since, starting at the ter-minal element (cid:126) m of σ ( (cid:126) m ) , we increase the number of vertices by exactly two at each stepin the traceback through σ ( (cid:126) m ) . To state it, we denote the number of elements in the sim-plification sequence σ ( (cid:126) m ) of (cid:126) m other than (cid:126) m by s ( (cid:126) m ) , the number of vertices in N ( (cid:126) m ) by n ( (cid:126) m ) , and the number of vertices in D ( (cid:126) m ) by d ( (cid:126) m ) . Lemma 4.1.
For any ploidy profile (cid:126) m, the phylogenetic network N ( (cid:126) m ) is binary, realizes (cid:126) m, and n ( (cid:126) m ) = d ( (cid:126) m ) + s ( (cid:126) m ) . Note that as the example of the ploidy profile ( k l , k ) , l , k ≥ (cid:126) m for which the length s ( (cid:126) m ) + σ ( (cid:126) m ) for (cid:126) m is at least k l − + l . As aconsequence of this, we also have that the number of hybrid vertices in N ( (cid:126) m ) can growexponentially in l . To address this, we next study simplifications sequences for specialtypes of ploidy profiles. To this end we define the dimension of a ploidy profile (cid:126) m to be thedimension of the multiplicity vector (cid:126) m . Proposition 4.2.
Suppose (cid:126) m = ( m , . . . , m n ) is a ploidy profile on X. If there exists someq ∈ { , . . . , n } such that m q > (and m q + = provided q + ≤ n) and m q > otherwisethen following holds(i) If k ≥ such that m i = k holds for all ≤ i ≤ q then s ( (cid:126) m ) = q − .(ii) If k ≥ and l ≥ q + are integers such that m i = k ( l − i ) holds for all ≤ i ≤ qthen s ( (cid:126) m ) = ( q − ) . HE HYBRID NUMBER OF A PLOIDY PROFILE 13
Proof.
Note first that for both statements, we may assume without loss of generality that q = n .(i): Since m i = m i + holds for all 1 ≤ i ≤ n − σ ( (cid:126) m ) =
1. Hence, q − (cid:126) m into the strictly simple ploidy profile (cid:126) k = ( k ) . Since (cid:126) k is the last element in σ ( (cid:126) m ) it follows that s ( (cid:126) m ) = q − m i − m i − = ≤ i ≤ q it follows that q − (cid:126) m into a ploidy profile (cid:126) m (cid:48) = ( m (cid:48) , . . . , m (cid:48) n ) in σ ( (cid:126) m ) such that m (cid:48) i = k holds for all 1 ≤ i ≤ q . Since s ( (cid:126) m (cid:48) ) = q − σ ( (cid:126) m (cid:48) ) is the sequence induced by σ ( (cid:126) m ) starting at (cid:126) m (cid:48) it follows in view of Proposition 4.2(i) that s ( (cid:126) m ) = q − (cid:3) The next result generalizes Corollary 3.5 to general ploidy profiles.
Lemma 4.3.
For any ploidy profile (cid:126) m on X, the phylogenetic network N ( (cid:126) m ) is semi-stable.Proof. Assume for contradiction that there exists a ploidy profile (cid:126) m = ( m , . . . , m n ) on X such that N ( (cid:126) m ) is not semi-stable. Since the construction of N ( (cid:126) m ) is initialized with asimple ploidy profile (cid:126) m (cid:48) = ( m (cid:48) , . . . , m (cid:48) l ) some l ∈ { n , n − } and, by Corollary 3.5, D ( (cid:126) m (cid:48) ) issemi-stable it follows that there must exist a step i in the construction of N ( (cid:126) m ) from σ ( (cid:126) m ) such that the generated network N i is not semi-stable but all networks N j constructed in theprevious steps 2 ≤ j ≤ i − i is the first step in σ ( (cid:126) m ) , i.e. the construction of N ( (cid:126) m ) from N ( (cid:126) m (cid:48) ) . Let X (cid:48) = { x (cid:48) , . . . , x (cid:48) l } denote the leaf set of N ( (cid:126) m (cid:48) ) . For all 1 ≤ i ≤ l we may assume without loss of generalitythat x (cid:48) i indexes m (cid:48) i .We claim first that m (cid:54) = m . Indeed, if m = m then (cid:126) m (cid:48) = ( m , . . . , m n ) . But then N ( (cid:126) m ) is obtained from N ( (cid:126) m (cid:48) ) by replacing the leaf x (cid:48) of N ( (cid:126) m (cid:48) ) that indexes m by the cherry { x , x } and relabelling the remaining leaves of N ( (cid:126) m (cid:48) ) appropriately. Since, by assumption, N ( (cid:126) m ) is not semi-stable it follows that N ( (cid:126) m (cid:48) ) is not semi-stable; a contradiction. Thus, m (cid:54) = m , as claimed.We next claim that m > m cannot hold either. Assume for contradiction that m > m .Put α = m − m > α > m . Then m (cid:48) = α and m (cid:48) j = m j , for all 2 ≤ j ≤ l .By the construction of N ( (cid:126) m ) from N ( (cid:126) m (cid:48) ) , it follows, for all 1 ≤ i ≤ l , that x (cid:48) i is replacedby x i , and that an arc ( u , u ) is added from a subdivision vertices u of the pendant arcsending in x to a subdivision vertex u of the pendant arc ending in x . Since N ( (cid:126) m (cid:48) ) issemi-stable and this does not introduce an identifiable pair of vertices in N ( (cid:126) m ) it followsthat N ( (cid:126) m ) is also semi-stable which is impossible.So assume that α ≤ m . Then x (cid:48) = x and there exists some j ∈ { , . . . , l } such that α = m (cid:48) j and, thus, is indexed by x (cid:48) j . Then we add an arc ( u , v ) where u is a subdivisionvertex of the pendant arc ending in x (cid:48) and v is a subdivision vertex of the pendant arcending in x (cid:48) j . Next, we relabel x (cid:48) j by x and remove x (cid:48) j . Finally, for all 1 ≤ i ≤ j − replace x (cid:48) i by x i + and, for all j + ≤ i ≤ l we replace x (cid:48) j by x j . Since N ( (cid:126) m (cid:48) ) is semi-stableby assumption and we again do not introduce an identifiable pair of vertices it follows thatthis case cannot hold either. This completes the proof of the claim.Thus, m < m must hold which implies that (cid:126) m is not a ploidy profile; a contradiction.Consequently, N ( (cid:126) m ) must be semi-stable. (cid:3)
5. T
HE HYBRID NUMBER OF A PLOIDY PROFILE
To prove the main result of this section (Theorem 5.1) which is also the main result ofthis paper we denote for a ploidy profile (cid:126) m = ( m , . . . , m l ) , some l ≥
1, the sum ∑ li = m i by µ ( (cid:126) m ) . Theorem 5.1.
Suppose (cid:126) m is a ploidy profile on X. Then N ( (cid:126) m ) is an attainment of (cid:126) m.Proof. Suppose (cid:126) m = ( m , . . . , m n ) is a ploidy profile on X = { x , . . . , x n } , 1 ≤ n . Then wemay assume without loss of generality that n ≥ (cid:126) m is simple and, therefore,the theorem follows in view of Theorem 3.4. To see the theorem, we perform induction on µ = µ ( (cid:126) m ) .The base case is µ =
2. If µ = n = m = = m or n = m = X = { x } . If n = m = = m then N ( (cid:126) m ) is a phylogenetic tree on X = { x , x } . Hence h ( N ( (cid:126) m )) = n = m = (cid:126) m is a strictly simple ploidy profile and the theorem follows in view ofTheorem 3.4.So assume for the remainder that µ ≥ (cid:126) m (cid:48) on some non-empty finite set Y for which µ ( (cid:126) m (cid:48) ) < µ holds. Note that we may alsoassume that (cid:126) m is not a simple ploidy profile as otherwise the theorem follows again byTheorem 3.4.Assume for contradiction that there exists a binary phylogenetic network Q on X thatrealizes (cid:126) m and h ( Q ) < h ( N ( (cid:126) m )) . Then there must exist a ploidy profile (cid:126) m (cid:48) = ( m (cid:48) , . . . , m (cid:48) l ) ,some l ≥
1, on some set Y (cid:48) in the simplification sequence σ ( (cid:126) m ) of (cid:126) m such that there existsa binary phylogenetic network Q (cid:48) on Y (cid:48) that realizes (cid:126) m (cid:48) and for which h ( Q (cid:48) ) < h ( N ( (cid:126) m (cid:48) )) holds. Without loss of generality we may assume that (cid:126) m (cid:48) is such that for all multiplicityvectors (cid:126) m (cid:48)(cid:48) succeeding (cid:126) m (cid:48) in σ ( (cid:126) m ) we have that h ( N ( (cid:126) m (cid:48)(cid:48) )) ≤ h ( Q (cid:48)(cid:48) ) for all binary phyloge-netic networks Q (cid:48)(cid:48) that realize (cid:126) m (cid:48)(cid:48) . Furthermore, we may assume without loss of generalitythat (cid:126) m (cid:48) = (cid:126) m .Let (cid:126) m = ( m , . . . , m q ) , some q ≥
1, denote the successor of (cid:126) m in σ ( (cid:126) m ) . Note that in case α = m − m = q = n − m i = m i + holds for all 1 ≤ i ≤ n −
1. If α > m then q = n and we have that m = α , and that m i = m i , for all 2 ≤ i ≤ n . In case α ≤ m then wealso have that q = n and that m = m . Furthermore, there exists some j ∈ { , . . . , n } suchthat m j = α , that m i = m i , for all 2 ≤ i ≤ j −
1, and that m i + = m i , for all j ≤ i ≤ n . Let Y = { y , . . . , y q } denote the set that indexes (cid:126) m . Without loss of generality, we may assume HE HYBRID NUMBER OF A PLOIDY PROFILE 15 that m N ( (cid:126) m (cid:48) ) ( y i ) = m i , for all 1 ≤ i ≤ q . Put N = N ( (cid:126) m ) and N = N ( (cid:126) m ) . We distinguishbetween the cases that (a) there exists some 2 ≤ l ≤ n such that m = m l and, provided l < n , that m l + (cid:54) = m and (b) that m > m . Case (a):
Note first that the minimality of h ( Q ) combined with the assumption that m = m l , some 2 ≤ l ≤ n and, provided that l < n exists, also m (cid:54) = m l + holds, implies thatthe induced subgraph T of Q connecting the elements in X (cid:48) = { x , . . . , x l } must be aphylogenetic tree on X (cid:48) and m = m j , for all 2 ≤ j ≤ l . Since (cid:126) m = ( m , . . . , m n ) andthe phylogenetic network Q obtained from Q by deleting x and its incoming arc (sup-pressing resulting vertices of indegree and outdegree one) and renaming x i + by y i , forall 1 ≤ i ≤ q clearly realizes (cid:126) m . Since µ ( (cid:126) m ) < µ it follows by assumption on (cid:126) m that h ( N ) ≤ h ( Q ) . Since N is obtained from N by subdividing the incoming arc of a certainleaf in L ( T ) − { x } by a new vertex u , adding in the new arc ( u , x ) , and reversing theaforementioned relabelling of x i + in terms of y i , it follows in view of the fact that T is atree that h ( Q ) < h ( N ) = h ( N ) ≤ h ( Q ) = h ( Q ) ; a contradiction. Consequently, h ( N ) mustattain (cid:126) m in this case. Case (b):
We claim that we can construct a phylogenetic network Q from Q that realizes (cid:126) m such that h ( Q ) = h ( Q ) +
1. To see this, note first that by Lemma 3.2 there exists adirected path P from the root ρ Q of Q to x such that every hybrid vertex of Q is containedin P .For the remainder of the proof of the claim, assume that h ∈ H ( Q ) is above x and x such that there exists no hybrid vertex in H ( Q ) that is strictly below h and also above x and x . Note that, by the minimality of h ( Q ) , a leaf x ∈ X − { x } of Q is below h if and onlyif the multiplicity of x is m . Thus we may assume without loss of generality that Q is suchthat there exists an arc a of Q such that one of the connected components of Q obtainedby deleting a is a subtree T whose leaf set comprises all leaves of Q of multiplicity m .Without loss of generality we may assume that x is the sole leaf of T .To obtain Q we next transform Q as follows (see Figure 5 for an illustration). Let u denote the unique child of h which, by the choice of h , must necessarily be a tree vertexof Q . Put α = m − m and assume that u , w ∈ V ( Q ) are such that a = ( u , w ) . Then wesubdivide the arc ( h , u ) by a new vertex u (cid:48) , add the arc ( u (cid:48) , w ) and delete a . Next, we deletethe arc ( u (cid:48) , u ) and add a new vertex ρ , an arc from ρ to the root of Q , and also the arc ( ρ , u ) . Last-but-not-least, we suppress all resulting vertices of indegree and outdegree one.In case α ≤ m we have m = m , we complete the construction of Q by indexing α by y j and m by y and, in Q , relabel appropriately the remaining x i by the elements in Y − { y j , y } . It is easy to check that Q is a phylogenetic network on Y , that realizes (cid:126) m .In case α > m we index m by y and α by y and, in Q , we relabel x by y and x by y . Relabelling the remaining leaves of Q appropriately in terms of Y it follows that Q is again a phylogenetic network on Y that realizes (cid:126) m . Since in either case µ ( (cid:126) m ) < µ ( (cid:126) m ) holds, it follows by assumption on (cid:126) m that h ( N ) ≤ h ( Q ) . F IGURE
5. A schematically depiction of the key step in the transfor-mation of the phylogenetic network Q into the phylogenetic network Q (cid:48) considered in the proof of Theorem 5.1. The non-dashed arcs togetherwith T make up Q and Q ∼ indicates the part of Q that is of no interest tothe discussion. Removing the two thin arcs and adding the three dottedarcs results in Q (cid:48) .Since in either case h ( Q ) = h ( Q ) + h ( Q ) < h ( N ) = h ( N ) + ≤ h ( Q ) + = h ( Q ) , which is impossible. This concludes the proof of Case (b) and therebythe proof of the theorem. (cid:3)
6. A
PPROXIMATING D ( (cid:126) m ) The definition of D ( (cid:126) m ) for a simple ploidy profile (cid:126) m = ( m , . . . , m n ) heavily relies onthe assumption that we can find a prime factor decomposition of m which may be compu-tationally infeasible given that ploidy levels can become very large (e.g. the silk glands ofthe commercial important silkworm Bombyx mori are known to have a ploidy level of upto 1.048.576 [4]). To address this problem we focus in this section on an approximation B ( (cid:126) m ) of D ( (cid:126) m ) which exploits the fact that any integer can be represented as a sum of pow-ers of two. The underlying rational for the definition of B ( (cid:126) m ) is the observation that everypolyploidization event doubles the number of chromosomes in a genome and, therefore,may be viewed as a power of two.We remark in passing that provided we can find a prime factor decomposition p α · . . . · p α q q of m we could also approximate each network N p i as B ( (cid:126) p i ) where (cid:126) p i is the strictlysimple ploidy profile ( p i ) .We start with some further definitions. Suppose that m is a positive integer. Then werefer to the vector ( i , . . . , i q ) , q ≥ i j (cid:54) =
0, for all 1 ≤ j ≤ q −
1, such that m = ∑ qj = i j holds as the binary representation vector of (cid:126) m . For example the binary representation vec-tor for m =
11 is ( , , ) . Following [26], we call an induced subgraph N (cid:48) of a phylogeneticnetwork N with | V ( N (cid:48) ) | = = | A ( N (cid:48) ) | a bead of N . Binary phylogenetic networks N on X such that N is either a phylogenetic tree on X or every hybrid vertex is contained in a HE HYBRID NUMBER OF A PLOIDY PROFILE 17 bead are called beaded trees (see e.g. [26] and [7] for more on such graphs). We denote abeaded tree with a unique leaf and q ≥ B ( q ) .We start with presenting the construction of B ( (cid:126) m ) for a strictly simple ploidy profile (cid:126) m = ( m ) on X = { x } . If m = B ( (cid:126) m ) is the phylogenetic tree whose unique vertexis x . So assume that m ≥
2. Let ( i , . . . , i q ) denote the binary representation vector of m .Then B ( (cid:126) m ) is the phylogenetic network obtained from the beaded network B ( i ) on { x } by attaching q − a j , 2 ≤ j ≤ q as follows. For all 2 ≤ j ≤ q , the tail of a j is a subdivision vertex of the bead B of B ( i ) that contains the root ρ of B ( i ) (we ensurethat if q ≥ ρ contains at least one subdivision vertex althoughother options are conceivable). The head of a j is a subdivision vertex of the outgoing arc ofthe hybridization vertex of B ( i ) that has precisely i j hybridization vertices of B ( i ) strictlybelow it. This completes the construction of B ( (cid:126) m ) in case (cid:126) m is strictly simple.Assume for the remainder that (cid:126) m = ( m , . . . , m n ) is a simple (but not strictly simple)ploidy profile on X = { x , . . . , x n } . To obtain B ( (cid:126) m ) we first construct B ( (cid:126) m (cid:48) ) for the strictlysimple ploidy profile (cid:126) m (cid:48) = ( m ) on { x } . To B ( (cid:126) m (cid:48) ) we add, as in the case of D ( (cid:126) m ) , n − s i , 2 ≤ i ≤ n , to the outgoing arcs of ρ in B ( i ) (ensuring again that if q > ρ contains at least one subdivision vertex). Finally, to each s i we attach the arc ( s i , x i ) , 2 ≤ i ≤ n . This completes the construction of B ( (cid:126) m ) in this case.Clearly, B ( (cid:126) m ) is a binary phylogenetic network on X that realizes (cid:126) m .To illustrate this construction, consider again the simple ploidy profile (cid:126) m = ( , , ) on X = { x , x , x } . Then B ( (cid:126) m ) and the phylogenetic network D ( (cid:126) m ) depicted in Figure 4are equivalent. As we shall see below, however, there exist simple ploidy profiles for whichthis is not the case.As an immediate consequence of the construction of B ( (cid:126) m ) we have the following resultas there exists no step in the construction of B ( (cid:126) m ) in which an identifiable pair of verticesis introduced. Lemma 6.1.
For any simple ploidy profile (cid:126) m the phylogenetic network B ( (cid:126) m ) is binary,realizes (cid:126) m, and is semi-stable. To gain insight into the structure of B ( (cid:126) m ) we next present formulae for counting, fora simple ploidy profile (cid:126) m , the number b ( (cid:126) m ) of vertices and also the number of hybridvertices of B ( (cid:126) m ) . Note that such formulae are known for certain types of phylogeneticnetworks without beads (see e.g.[18, 27] and [25] for more). To this end, we requirefurther terminology.Suppose m ≥ (cid:126) v m = ( v fm , . . . , v m , v m ) , some f = f ( m ) ≥ (1,0)-representation of m if, for all 0 ≤ i ≤ f , we have that v im is containedin { , } and indexed by 2 i and m = ∑ fi = i v im . For ease of presentation and unless statedotherwise, we always assume that, when reading from left to right, v fm is the first non-zerovalue of (cid:126) v m . For (cid:126) v m = ( v fm , . . . , v m ) , some f ≥
1, the (1,0)-representation of m , we put p ( m ) = |{ v im : v im = , for all 0 ≤ i ≤ f − }| . Furthermore, we denote for a simple ploidyprofile (cid:126) m = ( m , m , . . . , m n ) , the number of component values m i , 2 ≤ i ≤ n , such that m i = o ( (cid:126) m ) . To illustrate these definitions, consider the ploidy profile (cid:126) m = ( , , , ) . Then m = (cid:126) v m = ( , , ) . Hence, f ( m ) = p ( m ) =
1, and o ( (cid:126) m ) = Theorem 6.2.
Suppose that (cid:126) m = ( m , m , . . . , m n ) is a simple ploidy profile on X and thatm ≥ . Then b ( (cid:126) m ) = ( f ( m ) + p ( m ) + o ( (cid:126) m )) + . Furthermore, B ( (cid:126) m ) has f ( m ) + p ( m ) hybrid vertices.Proof. To see that b ( (cid:126) m ) = ( f ( m ) + p ( m ) + o ( (cid:126) m )) +
1, we distinguish the cases that (cid:126) m is strictly simple and that (cid:126) m is not strictly simple.Let X = { x , . . . , x n } and assume first that (cid:126) m is strictly simple. Then (cid:126) m = ( m ) and X = { x } . Hence, o ( (cid:126) m ) =
0. Thus, we need to show that b ( (cid:126) m ) = ( f ( m ) + p ( m )) + m ≥
2. Put m = m . If m = (cid:126) v m = ( , ) . Hence, f ( m ) = p ( m ) =
0. Since the root of B ( (cid:126) m ) is contained in the single bead of B ( (cid:126) m ) and | X | = b ( (cid:126) m ) =
3. Thus, the stated formula for b ( (cid:126) m ) holds.So assume that m ≥ (cid:126) m (cid:48) = ( m (cid:48) ) on X for which m (cid:48) ≤ m . Put (cid:126) m (cid:48) = ( m ) and assume that (cid:126) m is the strictly simpleploidy profile ( m + ) . Also, assume that f , f (cid:48) ≥ (cid:126) v m = ( v fm , . . . , v m , v m ) and that (cid:126) v m + = ( v f (cid:48) m + , . . . , v m + ) . Note that f = f ( m ) and f (cid:48) = f ( m + ) . Also notethat in case f (cid:54) = f (cid:48) we view (cid:126) m as a f (cid:48) -dimensional vector by extending (cid:126) m in the frontby the component value 0. Binary addition of the ( , ) -representations of m and of 1,respectively, where we extend the ( , ) -representation of 1 in a similar way implies that f (cid:48) ∈ { f , f + } . Furthermore, it implies that if f (cid:54) = f (cid:48) , then there must exist some 0 ≤ l ≤ f − v l + m + = v l + m and v im + (cid:54) = v im , for all 0 ≤ i ≤ l . Note that v f (cid:48) m + = = v fm and f (cid:48) = f + f (cid:54) = f (cid:48) . If f = f (cid:48) then l = f and v f (cid:48) m + = v fm . Furthermore, v lm = v lm + = B ( (cid:126) m ) can be obtained from B ( (cid:126) m (cid:48) ) note first that, for every simpleploidy profile (cid:126) p = ( p , . . . , p n ) and binary representation vector ( i , . . . , i q ) , q ≥
1, of p every hybrid vertex of B ( (cid:126) p ) naturally corresponds via a bijection χ (cid:126) p to a power of twowhere the hybrid vertex furthest away from the root of B ( i ) corresponds to 2 i . In view ofthis, we may assume without loss of generality that each hybrid vertex of B ( (cid:126) m (cid:48) ) is labelledby the power of two it corresponds to via χ (cid:126) m (cid:48) . Exploiting this observation, it follows that B ( (cid:126) m ) is obtained from B ( (cid:126) m (cid:48) ) via the following process where ρ denotes the root of B ( (cid:126) m (cid:48) ) .If l =
1, then we subdivide one of the outgoing arcs of ρ by a new vertex u and also theincoming arc of x by a vertex v and add the arc ( u , v ) .So assume l ≥
2. Then there must exist a component in the (0,1)-representation of (cid:126) v m whose value is one and, in (cid:126) v m + , the corresponding component value is zero. Forall 0 ≤ i ≤ l for which v im = v im + = B ( (cid:126) m (cid:48) ) whose headcorresponds via χ (cid:126) m (cid:48) to v im and suppress the resulting vertices of indegree and outdegree one.Let r denote the number of removed arcs. If f = f (cid:48) we next subdivide one of the outgoingarcs of ρ by a vertex u and the outgoing arc of h l by a vertex u and add the arc ( u , v ) . If f (cid:54) = f (cid:48) then we subdivide the incoming arc of x by two new vertices u and v . Assumingthat v is closer to the unique leaf x of B ( (cid:126) m (cid:48) ) than v , we then also add the arc ( u , v ) . Ineither case, it follows by induction that b ( (cid:126) m ) = b ( (cid:126) m (cid:48) ) − r + = ( f ( m ) + p ( m ) − r ) + . HE HYBRID NUMBER OF A PLOIDY PROFILE 19 If f = f (cid:48) (and so v fm + = v f (cid:48) m ), we have p ( m + ) = p ( m ) − r + f ( m + ) = f ( m ) .It follows that b ( (cid:126) m ) = ( f ( m + ) + p ( m + )) +
1, as required. So assume that f (cid:54) = f (cid:48) .Then f ( m + ) = f ( m ) + p ( m ) = r +
1. Furthermore, p ( m + ) = = p ( m ) − r − b ( (cid:126) m ) = ( f ( m ) + + p ( m ) − r − ) + = ( f ( m + ) + p ( m + )) + b ( (cid:126) m ) holds for strictly simpleploidy profiles.So assume that (cid:126) m is not strictly simple. Then since (cid:126) m is simple we have m =
1. Theformula now follows from the discussion of the case of strictly simple ploidy profiles com-bined with the observation that every m i (cid:54) =
1, 2 ≤ i ≤ n , accounts for two further verticesin B ( (cid:126) m ) when constructing B ( (cid:126) m ) from B ( (cid:126) m ) where (cid:126) m is the strictly simple ploidy profile ( m ) .The remainder of the theorem is straight-forward to see since, apart from the leaves,every tree vertex of B ( (cid:126) m ) can be paired up with a unique hybrid vertex of B ( (cid:126) m ) therebygiving rise to a 1-1 correspondence between the set of tree-vertices and hybrid vertices of B ( (cid:126) m ) . (cid:3) Note that Theorem 6.2 immediately implies that h ( B ( (cid:126) m )) is an upper bound on thehybrid number of a simple ploidy profile (cid:126) m = ( m , . . . , m n ) . This bound is clearly sharp incase m is a power of two or of the form 2 k +
1, some positive integer k . As the plot inFigure 6 indicates there are other cases where it is also sharp (although it is not sharp ingeneral). m H y b r i d N u m be r F IGURE
6. A comparison of h ( B ( (cid:126) m )) and h ( D ( (cid:126) m )) for the simpleploidy profile (cid:126) m = ( m ) where 1 ≤ m ≤ h ( D ( (cid:126) m )) < h ( B ( (cid:126) m )) are indicated as dots at the bottom. They arethe 16 primes given in the text along with the numbers 94, 166, 172, 188,215, 235, 249, 301, 326, 332, 344, 376, 387, 415, 423, 428, 430, 470,473, 489, 498.In fact there are 16 primes within the 93 primes in the interval considered for m in thatfigure for which h ( D ( (cid:126) m )) < h ( B ( (cid:126) m )) . They are 43, 47, 83, 107, 163, 263, 283, 317, 347, h ( D ( m )) = h ( B ( m )) might hold for some of the multiples of theseprimes.To indicate in the construction of N ( (cid:126) m ) that we have initialized it by B ( (cid:126) m ) (as opposedto D ( (cid:126) m ) ) we denote the resulting phylogenetic network by N B ( (cid:126) m ) . As an immediateconsequence of Theorem 6.2 and Lemma 4.1 we obtain Corollary 6.3.
For any ploidy profile (cid:126) m = ( m , . . . , m n ) the phylogenetic network N B ( (cid:126) m ) has b ( (cid:126) m ) + s ( (cid:126) m ) = ( f ( m ) + p ( m ) + s ( (cid:126) m ) + o ( (cid:126) m )) + vertices and at most f ( m ) + p ( m ) + s ( (cid:126) m ) hybrid vertices. As a further consequence of Theorem 6.2 we obtain the following result since, for sim-ple ploidy profiles, N B ( (cid:126) m ) is B ( (cid:126) m ) . Proposition 6.4.
Suppose (cid:126) m = ( m , . . . , m n ) is a ploidy profile on X. For all ≤ k ≤ n, let ( i k , . . . , i kl k ) denote the binary representation vector of m k , some l k ≥ . Then the followingholds.(i) h ( (cid:126) m ) ≤ ∑ nk = ( i k + l k − ) and this bound is sharp in case (cid:126) m is simple.In that case, h ( (cid:126) m ) = i + l − and B ( (cid:126) m ) attains that bound.(ii) If m i = i k holds for all ≤ k ≤ n then h ( (cid:126) m ) = ∑ nk = i k .Proof. (i) To see the stated inequality, we construct a binary phylogenetic network B on X = { x , . . . , x n } from (cid:126) m as follows. First we construct for all 1 ≤ k ≤ n the network B ( (cid:126) m k ) for the strictly simple ploidy profile (cid:126) m k = ( m k ) . Next, we add a new vertex ρ and, for all1 ≤ k ≤ n . an arc from ρ to the root of B ( (cid:126) m k ) . If the resulting phylogenetic network on X is binary then that network is B . Otherwise, B is the phylogenetic network obtained byresolving ρ .Clearly B realizes (cid:126) m and, implied by Theorem 6.2, we have h ( B ( (cid:126) m k )) = f ( m k )+ p ( m k ) = i k + l k −
1. Thus, h ( (cid:126) m ) ≤ h ( B ) = ∑ nk = ( i k + l k − ) , as required. If (cid:126) m is simple k = h ( B ) = h ( B ( (cid:126) m )) = i + l − B ( (cid:126) m k ) isthe network B ( i k ) . (cid:3)
7. D
ISCUSSION
Motivated by the signal left behind by poliploidization, we have introduced and studiedthe problem of finding the hybrid number h ( (cid:126) m ) of a ploidy profile (cid:126) m (although our argu-ments apply to any type of dataset that induces a multiplicity vector). Using the frameworkof a phylogenetic network, we provide a construction of a binary phylogenetic network N ( (cid:126) m ) that is guaranteed to attain that number. This allows us to derive exact formula forcomputing h ( (cid:126) m ) and also the size of the vertex set of N ( (cid:126) m ) in terms of a simplificationsequence σ ( (cid:126) m ) that we associate to (cid:126) m . Since, as we show, there exists an infinite family ofploidy profiles for which this sequence grows exponentially, we also provide a bound for h ( (cid:126) m ) and show that that bound is sharp for certain types of ploidy profiles. HE HYBRID NUMBER OF A PLOIDY PROFILE 21
Despite these encouraging results, numerous questions that might merit further researchremain. These include exploring the relationship between so called accumulation phylo-genies introduced in [2] and ploidy profiles. Given the centrality of the simplificationsequence σ ( (cid:126) m ) for (cid:126) m for computing h ( (cid:126) m ) it might also be of interest to see if more lightcan be shed into the length of σ ( (cid:126) m ) . Finally, it might also be of interest to better understandthe relationship between ancestral profiles introduced in [21] and ploidy profiles.R EFERENCES [1] C. Ane, B. Larget, D.A. Baum, and A. Rokas. Bayesian estimation of concordanceamong gene trees.
Molecular Biology and Evolution , 24:412–426, 2007.[2] M. Baroni and M. Steel. Accumulation phylogenies.
Annals of Combinatorics ,10:19–30, 06 2006.[3] M. Bordewich and C. Semple. Computing the minimum number of hybridiza-tion events for a consistent evolutionary history.
Discrete Applied Mathematics ,155(8):914 – 928, 2007.[4] F. D’amato and M. Durante. Polyplodiy.
Encyclopedia of the Life Sciences , 2002.[5] F. Rossello G. Valiente G. Cardona, M. Llabres. A distance metric for a class oftree-sibling phylogenetic networks.
Bioinformatics , 24:14841–1488, 2008.[6] D. Gusfield.
ReCombinatorics: The Algorithmics of Ancestral RecombinationGraphs and Explicit Phylogenetic Networks . MIT Press, 2014.[7] K.T. Huber, S. Linz, and V. Moulton. Weakly displaying trees in temporal tree-childnetwork. arXiv .[8] K.T. Huber and V. Moulton. Phylogenetic networks from multi-labelled trees.
Jour-nal of Mathematical Biology , 52:613–32, 2006.[9] K.T. Huber and V. Moulton. Encoding and constructing 1-nested phylogenetic net-works with trinets.
Algorithmica , 66:714–738, 2013.[10] K.T. Huber, V. Moulton, A. Spillner, S. Storandt, and R. Suchecki. Computing a con-sensus of multilabeled trees.
Proceedings of the Workshop on Algorithm Engineeringand Experiments , pages 84–92, 2012.[11] K.T. Huber, V. Moulton, M. Steel, and T. Wu. Folding and unfolding phylogenetictrees and networks.
Journal of Mathematical Biology , 73(6-7):1761–1780, 2016.[12] K.T. Huber, B. Oxelman, M. Lott, and V. Moulton. Reconstructing the evolutionaryhistory of polyploids from multilabeled trees.
Molecular Biology and Evolution ,23:1784–1791, 2006.[13] K.T. Huber and G. E. Scholz. Phylogenetic networks that are their own fold-ups.
Advances in Applied Mathematics , 113:101959, 2020.[14] D. Huson, R. Rupp, and C. Scornavacca.
Phylogenetic Networks . Cambridge Uni-versity Press, 2010.[15] S. Sagitov Jones, G. and B. Oxelman. Statistical inference of allopolyploidspecies networks in the presence of incomplete lineage sorting.
Systematic Biology ,62:467–478, 2013.[16] T. Marcussen, L. Heier, A. K. Brysting, B. Oxelman, and K. S. Jakobsen. From genetrees to a dated allopolyploid network: Insights from the Angiosperm genus Viola(Violaceae).
Systematic Biology , 64:84–101, 2015.[17] T. Marcussen, K. S. Jakobsen, J. Danihelka, H. E. Ballard, K. Blaxland, A.K. Bryst-ing, and B. Oxelman. Inferring species networks from gene trees in high-polyploid north american and hawaiian violets (viola, violaceae).
Systematic Biology , 61:107–126, 2012.[18] C. McDiarmid, C. Semple, and D. Welsh. Counting phylogenetic networks.
Ann.Combin. , 19:205–224, 2015.[19] W. F S. Tomasello Oberpieler, C. and K. Konowalik. A permutation approach forinferring species networks from gene trees in polyploid complexes by minimizingdeep coalescences.
Methods in Ecology and Evolution , 8:835–849, 2017.[20] M. Ownbey. Natural hybridization and amphiploidy in the genus Tragopogon.
Amer-ican Journal of Botany , 37:487–499, 1950.[21] M. Steel P. L. Erdos, C. Semple. A class of phylogenetic networks reconstructablefrom ancestral profiles.
Mathematical Biosciences , 313:33–40, 2019.[22] Emiko M. Waight L. Kubatko A. Wolfe Paul D. Blischak, Coleen E. P. Thompson.Inferring patterns of hybridization and polyploidy in the plant genus penstemon (plan-taginaceae).
BioRxiv .[23] M. Popp, P. Erixon, F. Eggens, and B. Oxelman. Origin and evolution of a circum-polar polyploid species complex in silene (caryophyllaceae) inferred from low copynuclear RNA polymerase introns, rDNA, and chloroplast DNA.
Systematic Biology ,30:302–13, 2005.[24] C. Soltis-Lemus and C. Ane. Inferring phylogenetic networks with maximum pseu-dolikelihood under incomplete lineage sorting.
PLoS Genetics , 12:e1005896, 2016.[25] M. Steel.
Phylogeny: Discrete and Random Processes in Evolution . Society forIndustrial and Applied Mathematics, 2016.[26] L. van Iersel, R. Janssen, M. Jones, Y. Murakami, and N. Zeh. Polynomial-time al-gorithms for phylogenetic inference problems involving duplication and reticulation.2019.[27] L. van Iersel and S. Kelk. Counting the simplest phylogenetic networks from triplets.
Algorithmica , 60:207–235, 2011.[28] F. Varoquaux, R. Blanvillain, M. Delseny, and P. Gallois. Less is better: new ap-proaches for seedless fruit production.
Trends in Biotechnology , 18:233–242, 2000.[29] Y. Yu J. Zhu Wen, D. and L. Nakhleh. Inferring phylogenetic networks using phy-lonet.
Systematic Biology , 67:735–740, 2018. S CHOOL OF C OMPUTING S CIENCES , U
NIVERSITY OF E AST A NGLIA , N
ORWICH , UK
Email address : [email protected] S CHOOL OF C OMPUTING S CIENCES , U
NIVERSITY OF E AST A NGLIA , N
ORWICH , UK
Email address ::