Parameter identifiability for a profile mixture model of protein evolution
aa r X i v : . [ q - b i o . P E ] J u l PARAMETER IDENTIFIABILITY FOR A PROFILE MIXTUREMODEL OF PROTEIN EVOLUTION
SAMANEH YOURDKHANI, ELIZABETH S. ALLMAN, AND JOHN A. RHODES
Abstract.
A Profile Mixture Model is a model of protein evolution, describing sequencedata in which sites are assumed to follow many related substitution processes on a singleevolutionary tree. The processes depend in part on different amino acid distributions, orprofiles, varying over sites in aligned sequences. A fundamental question for any stochas-tic model, which must be answered positively to justify model-based inference, is whetherthe parameters are identifiable from the probability distribution they determine. Herewe show that a Profile Mixture Model has identifiable parameters under circumstancesin which it is likely to be used for empirical analyses. In particular, for a tree relating9 or more taxa, both the tree topology and all numerical parameters are genericallyidentifiable when the number of profiles is less than 74. Introduction
A Profile Mixture model is a certain stochastic model of protein sequence evolutionthat describes the changes in sequences along the tree of evolutionary relationships of acollection of taxa. Such a model is often used for the inference of the tree from sequencedata, using standard maximum likelihood or Bayesian statistical frameworks. Here weinvestigate the question of parameter identifiability for this model: Are the model pa-rameters — both the tree topology and numerical ones — determined by a site patterndistribution arising from the model? Parameter identifiability, which informally meansthat valid parameter inference is possible in ideal circumstances, is an essential componentof the theoretical justification for standard statistical inference approaches.In models of protein sequence generation, amino acid site patterns are generally assumedto be independent and identically distributed across the sites. Common continuous-timemodels of amino acid substitutions are instances of the general time-reversible model (GTR) which assumes a single rate matrix Q constant over a metric tree, or extensionsthat allow for additional scalar rate variation at individual sites. The rate matrix Q hasoff-diagonal entries from R diag( π ), where R is a symmetric matrix of exchangeabilities and π is a vector of frequencies of the amino acids which remains stable under the model.In principle, one can infer R , π , and a metric tree of taxon relationships from pro-tein sequence data using standard statistical frameworks. However, with 20 amino acidsthe state space for the model is large, so an exchangeability matrix R is often fixedin advance, having been previously determined empirically for particular types of data.Well-known exchangeabilities for protein alignments include the JTT [Jones et al., 1992],WAG [Whelan and Goldman, 2001], and LG [Le et al., 2008] matrices. Date : June 30, 2020.
When inspecting protein sequence data, however, it is often clear that the GTR as-sumption of identically distributed sites is a poor one, since sites have visibly differentamino acid compositions. Site residue distributions, or profiles , likely differ because ofbiophysical properties of amino acids (e.g., hydrophilia, polarity, or charge), and the asso-ciated structural and functional constraints on the protein. This phenomenon suggests amodel with multiple classes of substitution processes, and in particular a mixture modelusing a variety of profiles with the same exchangeabilities for all classes. Mixture mod-els can provide better fit to data as they introduce more parameters, though they alsoincrease computational time and may lead to overfitting of the data.But a more fundamental issue with adopting a mixture model is that one may loseparameter identifiability. If several choices, or even more worrisome, infinitely manychoices of parameters lead to the same probability distribution under the model, theneven with an idealized infinite data set perfectly in accord with the model one couldnot recover the parameter values under which the data arose. Since the goal of mostphylogenetic analyses is to infer model parameters — generally the topological tree butoften numerical parameters as well — identifiability is an essential property for a modelto be useful. Non-identifiability poses particular challenges in Bayesian MCMC analyses,where it may be manifested as a lack of convergence [Rannala, 2002].For non-mixture site substitution models in phylogenetics parameter identifiability haslong been established, but mixture models provide greater challenges. Although compu-tational work may suggest whether it holds or fails, parameter identifiability can onlybe established theoretically as it is a model property, and not dependent on an inferencemethod. In recent years algebraic methods have been introduced and successfully appliedto a number of phylogenetic mixture models, see, for example, Allman and Rhodes [2006,2008, 2009], Allman et al. [2010, 2011, 2019], Chifman and Kubatko [2015], Long and Sullivant[2015], Hollering and Sullivant [2019], Wascher and Kubatko [2020]. While one of theseworks [Rhodes and Sullivant, 2012] established a rather general result on parameter iden-tifiability of phylogenetic mixture models with many components, it unfortunately doesnot apply to the profile mixture model’s specific structure.In this work, we prove parameter identifiability for a
Profile Mixture Model (PM) of amino acid site substitution. PM models were introduced in the Bayesian context[Lartillot and Philippe, 2004, Lartillot et al., 2009, 2013] where the number of profilesmight be inferred using a Dirichlet process prior, and as finite mixtures with a fixed num-ber of components in a Maximum Likelihood analysis [Le et al., 2008]. Studies suggestthat PM models perform better than single-class models, particularly on data that is satu-rated or with an underlying long branch attraction bias [Lartillot et al., 2007, Wang et al.,2008]. Mixtures with as many as 60 classes have been investigated with empirical datasets, with indications that around 20 profiles often provides good fit [Le et al., 2008]. Fora recent study assessing the performance under simulation of mixture models includingdiscrete-Γ rates-across-sites and PM models, see Wang et al. [2014].Our main result, Theorem 5.7, establishes that parameters of a profile mixture modelwith up to 73 classes on a tree of 9 or more taxa, are generically identifiable ; that is,identifiable outside an exceptional parameter set of measure zero. For any fixed number
ROFILE MIXTURE MODEL IDENTIFIABILITY 3 of classes, the parameters include the tree topology, the tree’s edge lengths, the exchange-abilities, the profiles, and weights of the mixture components.The proof techniques we employ are algebraic in nature, using ideas from tensor decom-position and algebraic geometry. These tools, which have been introduced and used pre-viously for phylogenetic models [Allman and Rhodes, 2006, 2009, Rhodes and Sullivant,2012], are based in the algebraic properties of matrices and 3-way tensors obtained fromrearranging the entries of the distribution of site pattern frequencies. However, the struc-ture of the PM model, with profiles varying over classes while the exchangeabilities donot, introduce important differences that prevent any easy deduction of the result fromprevious work. At several points in our arguments we use exact integer computation,performed by the software
Pari/GP [The PARI Group, 2019], to establish certain genericconditions we need on ranks of matrices.As motivated by applications to amino acid models, our main theorem is stated forthe profile mixture model with a state space of size 20. However, the techniques used forestablishing it apply to arbitrary sizes κ of the state space. For example, κ might be 4for DNA, or 61 for codons. However, appropriate rank computations would need to becarried out to complete the proof in such contexts. In the κ = 20 setting we also believethe proof techniques could be pushed to establish identifiability for more than 73 profiles,at the expense of requiring more taxa on the tree.This paper is organized as follows: In Section 2 we introduce phylogenetic substitutionmodels, and in particular the profile mixture model under study. Section 3 provides al-gebraic definitions and lemmas, though removed from the biological setting of interest.Section 4 then connects the phylogenetic profile mixture model with these algebraic no-tions. We conclude in Section 5 with the proof of our main theorem on identifiability ofthe PM model parameters. 2. Markov Models on Trees
We begin by introducing Markov models of site substitution along a tree. Throughout,let κ be the size of the state space, which we identify with [ κ ] = { , , , . . . , κ } . Forprotein data, κ = 20. Let T ρ be a rooted topological tree, with root ρ and leaves labelledby elements of the taxon set X . The general Markov model of κ -state sequence evolutionalong T ρ is parameterized by 1) A 1 × κ vector π giving the distribution of states atthe root; and 2) for each edge e directed away from the root, a κ × κ Markov matrix M e giving the conditional probabilities of state transitions along e . These determine theexpected site pattern frequency array, or joint distribution of states at the leaves, whichwe view as a κ × κ × · · · × κ | {z } n array or tensor, P . Each site in an alignment is modeled asindependent and identically distributed according to P .A subclass of general Markov models is composed of the general time-reversible models(GTR) . For a GTR model, there is an single underlying rate matrix Q , and for each edge e of T ρ a length t e with M e = exp( Qt e ). Time-reversibility is the assumption that for somesymmetric κ × κ matrix R of non-negative exchangeabilities and the root distribution π thediagonal entries of Q are those of the product R diag( π ), with the diagonal entries chosen SAMANEH YOURDKHANI, ELIZABETH S. ALLMAN, AND JOHN A. RHODES so that row sums are zero. This results in diag( π ) Q = Q T diag( π ). One consequence oftime-reversibility is that the Markov matrix M e is independent of the direction of e . Itfollows that the tree parameter in a GTR model is de facto unrooted since the locationof the root is not identifiable. We repeatedly take advantage of this to ‘move the root’ tolocations in T convenient for our arguments.Profile mixture models are finite mixtures of GTR models, where the underlying ex-changeability matrix R is the same for each class. The particular profile mixture modelexamined here has parameters as follows. Definition 2.1.
Let T be a rooted topological tree, κ ≥ a number of states, and m ≥ a number of classes. Then the numerical parameters of the Profile Mixture Model on T ,PM=PM ( T, κ, m ) , are: (1) a collection of non-negative branch lengths { t e } , one for each edge e of T ; (2) a symmetric κ × κ matrix R of non-negative exchangeabilities; (3) a collection of m class weights { w i } , with w i > and P w i = 1 ; and (4) For each class i = 1 , , . . . , m , − a × κ root distribution vector π i , called a profile ; and − a scalar rate parameter r i ≥ . The scalar rate parameters { r i } are used to incorporate across-site rate variation intothe PM model. Specifically, for class i with Q i the rate matrix determined by R , π i ,the Markov matrix on edge e in T is M ei = exp( r i Q i t e ). We note that site rate vari-ation for PM models may be implemented differently in software, with a rate for eachsite [Lartillot and Philippe, 2004] or with a discrete-Γ(4) [Le et al., 2008]. In the firstimplementation, the PM model is very likely overparameterized and ideally the MCMCwould limit the number of rate multipliers. Implementation of the rate variation usinga discrete-Γ has a long history in computation phylogenetics [Yang, 1994], but proofsof such rate variation identifiability are only known for the continuous Γ [Allman et al.,2008, Chai and Housworth, 2011].While probability distributions from mixture models are often described as weightedsums of distributions from the various classes, phylogenetic mixture models can be equiv-alently presented as a single model on a tree T with mκ states at internal nodes of T ,and κ states at the leaves. The internal states are pairs ( i, j ) where i is a class and j ∈ [ κ ] is a ‘usual’ state. In this formulation, Markov matrices on internal edges e for thePM model are mκ × mκ block diagonal matrices, where the the m blocks are the M ei , i = 1 , . . . , m . The block structure prevents changes from one class to another, thoughthe ‘usual’ states may change within the class. For the terminal edges e of T , leading toleaves where the class information is not observable, the PM Markov matrix for an edgeis formed by stacking the m Markov matrices M ei for the classes. The root distributionis an mκ vector formed by concatenating w i π i for the classes.We collect these observations for parameterizing the PM model on a tree. Definition 2.2.
Given parameters for the profile mixture model
P M ( T, κ, m ) , assumethat T is rooted at r . Then the × mκ vector Π = Π r = ( w π , w π , . . . , w κ π κ ) , the mκ × mκ matrices M e = exp ( Qt e ) where Q is block diagonal with blocks r i Q i for each ROFILE MIXTURE MODEL IDENTIFIABILITY 5 internal edge e of length t e , and the mκ × κ matrix M e formed by stacking the matrices M ei for each class i on a terminal edge give a parameterization of the PM model as aMarkov model of site substitution on T . Since our main goal is to prove parameter identifiability for the PM model, we formallydefine the notion of generic identifiability.
Definition 2.3.
Consider a parametric model, specified by a parameterization map φ fromsome parameter space to a space of probability distributions. If φ is one-to-one, then themodel parameters are identifiable . If φ is one-to-one except possibly on a subset of measurezero in the parameter space, then the model parameters are generically identifiable . It is well known that for the GTR model some normalization is needed for rates andbranch lengths since Qt = ( sQ ) (cid:0) ts (cid:1) shows rescaling all rates in Q can be offset by decreas-ing branch lengths. Once understood and addressed, this model overparameterization, orlack of identifiability, is of little consequence. Typically, the rate matrix Q is normalizedso that branch lengths are measured in expected number of substitutions per site over theelapsed time. In the strictest sense, only the normalized variant of the GTR model hasidentifiable parameters, a result used in our proof of the main theorem. Theorem 2.4.
For a single class GTR model on an unrooted metric tree, the tree topologyand all numerical parameters are generically identifiable, up to a normalization of Q . Algebraic Definitions and Lemmas
In this section we collect algebraic definitions and theorems that will play a role in ouranalysis of the PM model. We present these in a purely algebraic setting, deferring theconnection to the phylogenetic models, and in particular the PM model, to later sections.We begin by defining tensors and certain algebraic operations on them leading up to atheorem of J. Kruskal on the structure of 3-way tensors, an important tool that we willuse several times. We then briefly introduce algebraic varieties and conclude by stating atheorem for identifying generic properties, a tool also used repeatedly in our proofs.3.1.
Tensors.
Our first definition is a standard one.
Definition 3.1.
Let A be an m × k matrix and B be an n × l matrix. The tensor , or Kronecker, product A ⊗ B is the mn × kl matrix whose rows are indexed by the orderedpair ( i , j ) , i ∈ [ m ] , j ∈ [ n ] and whose columns are indexed by ordered pair ( i , j ) , i ∈ [ k ] , j ∈ [ l ] such that the (( i , j ) , ( i , j )) entry is ( A ⊗ B ) ( i ,j ) , ( i ,j ) = a i i b j j . Less standard is the following.
Definition 3.2.
Let A be an m × c matrix and B be an m × c matrix. The row tensorproduct A ⊗ r B is the m × c c matrix with entries indexed by ( i, ( j, k )) for i ∈ [ m ] , j ∈ [ c ] , k ∈ [ c ] , ( A ⊗ r B ) i, ( j,k ) = a ij b ik . In the case that A = B and ℓ is a positive integer, then the ℓ th row-tensor power of A isthe m × k ℓ matrix A ⊗ ℓr = A ⊗ r A ⊗ r · · · ⊗ r A | {z } ℓ . SAMANEH YOURDKHANI, ELIZABETH S. ALLMAN, AND JOHN A. RHODES
We do not specify the precise order of row and column indices in these tensor products,since for our applications it will either be clear from context, or inconsequential. In par-ticular, we often only need results on the ranks of these products, which are independentof row and column ordering.Since Kruskal’s Theorem concerns 3-way tensors, we next describe reformatting n -waytensors into 3-way ones. Suppose P is an n -way tensor with indices labeled by X . Then a tripartition I | J | K of X is a collection three disjoint non-empty subsets of X whose unionis X , X = I ⊔ J ⊔ K . A bipartition of X , or a split , is defined similarly, with the disjointsets required to be non-empty. Definition 3.3.
Let A be an n -way κ × · · · × κ tensor with I | J a split of the index set X .Then the matrix flattening of A with respect to I, J , denoted
Flat I | J ( A ) , is a κ | I | × κ | J | matrix. If, by permuting indices, we assume that I = { , , · · · , | I |} , J = {| I | + 1 , · · · , n } ,then the ( i , j ) -entry is (cid:0) Flat I | J ( A ) (cid:1) i , j = A ( i , . . . , i | I | , j , . . . , j | J | ) , for i = ( i , . . . , i | I | ) and j = ( j , . . . , j | J | ) .Similarly for a tripartition I | J | K of X , the -way tensor Flat I | J | K ( A ) is (cid:0) Flat I | J | K ( A ) (cid:1) i , j , k = A ( i , j , k ) , where i ∈ [ κ ] | I | , j ∈ [ κ ] | J | , and k ∈ [ κ ] | K | .Example . Suppose A is a 20 × × × × ×
20 6-way tensor, and let I = { , } , J = { } , and K = { , , } . Then Flat I | J | K ( A ) is a 20 × × tensor with, for example, (cid:0) Flat I | J | K ( A ) (cid:1) (10 , , (8) , (15 , , = A (10 , , , , , . Kruskal’s theorem requires the notion of a 3-way tensor obtained as sum of “outerproducts” of the rows of 3 matrices.
Definition 3.4.
Let A be a k × n A matrix with i th row r Ai = (cid:0) r Ai (1) , · · · , r Ai ( n A ) (cid:1) , andsimilarly for matrices B and C of size k × n B and k × n C respectively. Then [ A, B, C ] denotes the -way n A × n B × n C tensor [ A, B, C ] = k X i =1 r Ai ⊗ r Bi ⊗ r Ci , where the tensor products in the summands are formatted to preserve an index for eachmatrix. For instance, r A ⊗ r B = ( r A ) T · r B is n A × n B , where T denotes the transpose. To illustrate, suppose that
A, B, C are 2 ×
2, 2 ×
3, and 2 × A = (cid:18) (cid:19) , B = (cid:18) (cid:19) , C = (cid:18) (cid:19) . ROFILE MIXTURE MODEL IDENTIFIABILITY 7
Then P = [ A, B, C ] is the 2 × × C index givenby P ( · , · ,
1) = (cid:18)
61 77 9382 104 126 (cid:19) , P ( · , · ,
2) = (cid:18)
74 94 114100 128 156 (cid:19) ,P ( · , · ,
3) = (cid:18)
87 111 135118 152 186 (cid:19) , P ( · , · ,
4) = (cid:18)
100 128 156136 176 216 (cid:19) . As a simple extension of Definition 3.4 for use with phylogenetic models, we write[ π ; A, B, C ] = [diag( π ) A, B, C ] = k X i =1 π i r Ai ⊗ r Bi ⊗ r Ci , where π = ( π , π , · · · , π k ) . Before the stating Kruskal’s Theorem, we need the following.
Definition 3.5.
Let A be a matrix. The Kruskal (row) rank of a matrix A is the largestnumber k such that every set of k rows of A are independent. For example, letting V denote the set of all 3 × a b ca b cd e f , where ( a, b, c ) , ( d, e, f ) are independent. These matrices have rank 2 but Kruskal rank 1,and form a subset of lower dimension inside the 9-dimensional space V .It is clear that Kruskal rank is less than or equal to matrix rank, but when a matrix hasfull row rank, the two notions coincide. In subsequent sections, we exploit this observationby creating matrices with full row rank and therefore full Kruskal rank.Kruskal’s theorem can be viewed as a generic identifiability theorem for 3-way arrays,showing that triple products satisfying a particular rank condition are decomposable inessentially a unique way. Theorem 3.6 (Kruskal [1977]) . Let
A, B, C be l × n A , l × n B , and l × n C matrices withKruskal rank p, q, r respectively. If (2) p + q + r ≥ l + 2 , then A, B, C are uniquely determined by [ A, B, C ] , up to simultaneous permutation andscaling of their rows. More precisely, if [ A, B, C ] = [ A ′ , B ′ , C ′ ] then there exist invertiblediagonal matrices D , D and a permutation matrix P such that A ′ = P D A, B ′ = P D B, C ′ = P D − D − C. By way of contrast, note that for two compatible matrices A , B , the natural analog ofthe bracket product is the matrix product [ A, B ] = A T B . However, from [ A, B ], A and B can not be determined uniquely, since there are many matrix products that give the SAMANEH YOURDKHANI, ELIZABETH S. ALLMAN, AND JOHN A. RHODES same result. For instance, A T B = ( QA ) T ( QB ) for any orthogonal matrix Q . Kruskal’stheorem thus states a significant difference between matrices and 3-way tensors.3.2. Generic points in parameter space.
Algebraic geometry provides a convenienttool for understanding exceptional sets, like those that fail to satisfy the rank conditionsnecessary to apply Kruskal’s Theorem. We briefly give the needed definitions.
Definition 3.7.
Let S be a finite set of polynomials in C [ x , . . . , x n ] . The common zeroset in C n of the polynomials in S is the algebraic variety V ( S ) . A subset of a varietythat is itself a variety is called a subvariety . For any algebraic variety V ( S ) ⊆ C n , the ideal I ( V ( S )) is the set of all polynomials f ∈ C [ x , . . . , x n ] such that f ( v ) = 0 for all v ∈ V ( S ) . The main result of this work is that PM model parameters are identifiable except for‘rare’ choices. This is expressed using the following terminology.
Definition 3.8.
A property is generic on a full-dimensional subset W of R n or C n if itholds at all points of W except possibly for those points in some subset U ⊂ W of measure0. If V is an algebraic variety in C n , we say a property is generic on V if it holds at allpoints except those in a proper subvariety of V . Note that proper subvarieties of varieties always have measure 0, so these notions ofgeneric are consistent with one another.
Example . The set of 3 × V ( S ) with S = { } . The propertyof having rank, or equivalently Kruskal rank, 3 is generic on V , since matrices of rank atmost 2, including those of the form (1), lie in a finite union of lower dimensional sets. Thissubvariety of exceptional matrices is defined by a single polynomial, the 3 × Proposition 3.9.
Let
Φ : U → C n be an complex analytic map with U an open subset of C ℓ . Let V be a variety in C n . Suppose f ∈ I ( V ) , and that there exists a point p = Φ( u ) with f ( p ) = 0 . Then for generic points u ∈ U or u ∈ U ∩ R n , the point Φ( u ) lies off of V .Proof. This follows from basic properties of complex analytic functions of many variables(see, for instance, the text by Range [1986]). The function f ◦ Φ is analytic, and notidentically zero. Its zero set is therefore of measure zero, so for generic u ∈ U , Φ( u ) liesoff V ( f ) ⊇ V . The real points in the zero set must similarly have measure zero. (cid:3) Rank Propositions.
For the proof of our main theorem, the ranks and Kruskalranks of some special matrices arising in the PM model are needed, and we compilethese rank computations here. By giving these algebraic results in advance, the proofof Theorem 5.7 can be presented more cleanly. Note that our arguments depend inpart on some computations that were performed with the software
Pari/GP . As these
ROFILE MIXTURE MODEL IDENTIFIABILITY 9 computations were performed using exact integer arithmetic, they may be taken as validproofs, up to the usual assumptions of correct programming and no hardware faults.We begin by defining a particular structured matrix that can arise from particularparameter choices for the PM model.
Definition 3.10.
With a i ∈ C for i ∈ [ κ ] , and s = a + · · · + a κ , let M ( a , . . . , a κ ) denotethe κ × κ matrix (3) M ( a , . . . , a κ ) = a − s a · · · a κ a a − s · · · a κ ... . . . ... a a · · · a κ − s . Proposition 3.11.
For κ = 20 and m ≤ , let M be a mκ × κ matrix formed by stacking m ≥ choices of matrices of the form M ( a , . . . , a κ ) . Then M ⊗ ℓr has full row rank forgeneric choices of the a i when ℓ ≥ .Proof. We begin with the special case of ℓ = 3. An exact Pari/GP calculation shows thatfor m = 77 by picking distinct random integers for a , . . . , a κ for each of the m blocks in M , we may find a point p = M for which M ⊗ r has full row rank. By removing some ofthe blocks from this example if m <
77 we obtain a point p for which M ⊗ r has full rowrank for smaller m as well.To show that full row rank is a generic condition when ℓ = 3, fix m ≤
77, and observethat the map from the space C mκ of the a i to M is analytic. Since p = M gives M ⊗ r fullrow rank, there is some mκ × mκ minor f of M ⊗ r which when viewed as a polynomialin the entries of M has f ( p ) = 0. Taking V = V ( f ), then Proposition 3.9 shows thatgeneric choices of the a i give f ( M ) = 0 so M ⊗ r has rank mκ .Now consider ℓ >
3. Then M ⊗ ℓr = M ⊗ r ⊗ r M ⊗ ℓ − r , where M ⊗ r = ( µ ij ) is a mκ × κ matrix and M ⊗ ℓ − r = ( α kl ) is a mκ × κ ℓ − matrix. Since M ⊗ r has full row rank mκ forgeneric M , its rows are independent. But, with v = mκ , M ⊗ r ⊗ r M ⊗ ℓ − r = µ α µ α · · · µ κ α · · · · · · µ α µ α · · · µ κ α · · · · · · ... . . . ... µ v α v µ v α v · · · µ vκ α v · · · · · · , so it is enough to know that the entries of some single column of M ⊗ ℓ − r are nonzero andthat M ⊗ r has independent rows to ensure M ⊗ ℓr has independent rows. But this is truefor generic choices of parameters for M . (cid:3) The next proposition gives a lower bound on Kruskal row rank, valid for all M ⊗ ℓr . Proposition 3.12.
For κ ≥ , let M be a mκ × κ matrix formed by stacking m ≥ choices of matrices of the form M ( a , . . . , a κ ) . For ℓ ≥ , M ⊗ ℓr has Kruskal row rankgreater than or equal to for generic choices of the a i . Proof.
Consider first the case that ℓ = 1. The matrices of Kruskal rank at most 1 forman algebraic variety V . By Proposition 3.9, it is enough to find a single matrix M notin V to see that generically such matrices have Kruskal rank at least two. Choose mκ distinct positive small numbers as the free entries a , . . . , a κ in each block of M , so thatthe diagonal entries are the largest in the block. Then no two rows within any block M ( a , . . . , a κ ) are multiples of each other, and no two rows of different blocks are multipleseither, since the a i ’s are distinct. Thus M has Kruskal rank greater than or equal to two.The case when ℓ > (cid:3) The final propositions in this section involve generic ranks of stacked matrices formedby taking certain tensor products of matrices of the form above.
Proposition 3.13.
Let M be a mκ × κ matrix formed by stacking m choices of matricesof the form M ( a , · · · , a κ ) ⊗ r ⊗ M ( a , · · · , a κ ) . Then for κ = 20 and m < , the matrix M has rank greater than mκ for generic choices of the a i .Proof. A Pari/GP calculation shows that for some choice of random integers a i , M has(1) full row rank 400 > mκ = 20, when m = 1;(2) full row rank 800 > mκ = 40, when m = 2;(3) rank 1180 > mκ = 60, when m = 3; and(4) rank 1540 > mκ = 80, when m = 4.Furthermore, by (4), for m ≥
5, there exists a matrix M with rank at least 1540 = 20 × a i ’s, since we may repeat some blocks. Using Proposition 3.9, the statedrank condition on M is thus generic for all m < (cid:3) Proposition 3.14.
Let M be of the form of M in Proposition 3.13, and M be formed bystacking m matrices of the form M ( a , · · · , a κ ) ⊗ M ( a , · · · , a κ ) ⊗ r . Let L be a mκ × mκ diagonal matrix with positive entries. Then for κ = 20 and m < , M T LM has rankgreater than mκ for generic choices of the a i .Proof. Sylvester’s rank inequality givesrank( M T LM ) ≥ rank( M T ) + rank( LM ) − mκ . Since M and M differ only by row and column permutations, they have the same rank.Moreover, rank( L ) = rank( LM ) since L is a diagonal matrix with positive entries. Then,by Proposition 3.13, there is a choice of a i ’s so that M T LM has rank at least(1) 400 + 400 −
400 = 400 > mκ = 20, when m = 1;(2) 800 + 800 −
800 = 800 > mκ = 40, when m = 2;(3) 1180 + 1180 − > mκ = 60, when m = 3; and(4) 1540 + 1540 − > mκ = 80, when m = 4.The rank computation for m = 4 shows additionally that there exist choices of a i givingrank( M T LM ) = 1480 for larger m , since blocks can be repeated. But 1480 = 20 ×
74 soby Proposition 3.9, generically then the rank must be greater than mκ for all m < (cid:3) ROFILE MIXTURE MODEL IDENTIFIABILITY 11 Algebraic Aspects of the Profile Mixture Model
Next we relate the algebraic definitions made in the previous section to phylogeneticmodels and the PM model in particular. We begin by describing how a row tensor productof Markov matrices relates to parameters on a star tree.
Definition 4.1.
Let A be a set of taxa on a star tree rooted at its internal node, withpendant edges e , . . . , e | A | and associated Markov matrices M e i . Then (4) M A = M e ⊗ r · · · ⊗ r M e | A | . For an m -class PM model on a star tree, the matrix M A is of size mκ × κ | A | . Its entriesare conditional probabilities of observing different | A | -tuples of states at the taxa in set A , given the state at the root.Given a tree T on taxa X , tripartitions and splits of X can be associated to thetopological structure of T . For instance, the tree of Figure 1 displays a tripartition A | B | C with A = { a, b, c } , B = { d, f } , C = { g, h } . Formally, a tripartition A | B | C is displayed ona tree if there is some vertex v of T whose deletion results in three subtrees with A, B, C labeling their leaves. Similarly, if A ′ = { a, b, c } and B ′ = { d, f, g, h } , then X = A ′ ⊔ B ′ ,and T displays the split A ′ | B ′ of X , since there is an edge e whose deletion results in twosubtrees with leaves labeled by A ′ and B ′ . hg abd f cv e Figure 1.
A tree displaying the tripartition A | B | C and the split A | B ∪ C ,where A = { a, b, c } , B = { d, f } , C = { g, h } .When a tree T displays a tripartition of a set of taxa, then the flattening of a jointdistribution corresponding to that tripartition can be expressed using the 3-way matrixproduct of certain matrices built from model parameters. Lemma 4.2.
Suppose T is a tree on a set of taxa X rooted at an internal vertex v andthat T displays the tripartition A | B | C associated to v . Let P be a probability distributionfor a Markov model M on T with ℓ states at the internal nodes. Then there exist matrices M A , M B , M C constructed from model parameters for M , each with ℓ rows, such that Flat A | B | C ( P ) = [ M A , M B , M C ] . Proof.
From the parameters on T we may define Markov matrices M A , M B , M C whoseentries are conditional probabilities of states at the leaves in each set A, B, C , given thestate at v . Let π be the state distribution at v . ThenFlat A | B | C ( P ) = [ π ; M A , M B , M C ] = [ M A , M B , M C ] , where M A = diag( π ) M A , M B = M B , and M C = M C . (cid:3) For establishing generic properties of the PM model, we will often consider the par-ticular choice of the exchangabilities given by the matrix R = whose entries are all 1.This is in essence the CAT-F81 model [Lartillot and Philippe, 2004, Le et al., 2008], withthe number of profiles some fixed m . For this R , a Markov matrix has the form given inequation (3) of Definition 3.10. Lemma 4.3.
Consider the PM model
P M ( T, κ, m ) with R = , and let e be a branchof T of length 1. Then for a single class c with profile π and rate r ≥ , the Markovmatrix M ec = exp( Q c r ) for e is of the form M ( a , . . . , a κ ) of Definition 3.10, with a i = π i (1 − e − r ) ≥ and s = P κi =1 a i satisfying ≤ s < .Conversely, any κ × κ Markov matrix of the form M = M ( a , . . . , a κ ) with a j ≥ and ≤ s < comes from a choice of parameters for one class of the PM model with R = on an edge of length 1.Provided s = 0 (equivalently r = 0 ), this correspondence is one-to-one.Proof. The first statement follows by direct computation: With e j the standard basisvectors, Q c = R diag( π ) − I has right eigenvectors − π j e + π e j with eigenvalues − ≤ j ≤ κ , and eigenvector P κj =1 e j with eigenvalue 0.For the converse, since 0 ≤ s <
1, there is a unique r ≥ s = 1 − e − r . If s > π j = a j /s for j = 1 , · · · , κ , and π = ( π j ). Then P κj =1 π j = 1, and a j = π j (1 − e − r ).With these choices Q = R diag( π ) − I , and M = exp( rQ ). If s = 0, then all the a j arezero, and M is the identity matrix. Take r = 0 and π arbitrary. Then M = exp(0 Q ). (cid:3) Identifiability of Parameters for the Profile Mixture Model
With preliminaries completed, we now turn to establishing our main result, on genericparameter identifiability for the PM model. The first step is to understand that the ranksof matrix flattenings of a model distribution are affected by whether the associated splitis, or is not, displayed on the tree T . Proposition 5.1.
Let T be an n -taxon tree on X and P a distribution from the modelPM= PM ( T, κ, m ) with κ = 20 and m < . Suppose that A | B is a split of X with | A | , | B | ≥ . (1) If A | B is displayed on T , then Flat A | B ( P ) has rank at most mκ ; (2) If A | B is not displayed on T , then Flat A | B ( P ) generically has rank greater than mκ . Before beginning the proof, we present a simplified example to illustrate how the matrixrank of flattenings of joint distributions from Markov models on trees carries informationabout the absence/presence of an internal edges on T . Example . Consider a single-class 2-state Markov model on the 4-taxon tree shown inFigure 2. A special case of this model is
P M ( T, , T is the 2 × × × P , with entries p ijkl indexed by leaves in the order a, b, c, d . ROFILE MIXTURE MODEL IDENTIFIABILITY 13 a cdb v v Figure 2.
A 4-taxon tree with split { a, b }|{ c, d } .With A = { a, b } and B = { c, d } , the rows and columns of Flat A | B ( P ) are indexedby elements of [2] × [2]. For example, the ((1 , , (1 , p . In contrast, if A ′ = { a, c } and B ′ = { b, d } , the flattening Flat A ′ | B ′ ( P ) has ((1 , , (1 , p .Now suppose that the terminal edges of T have length 0, so that the states at a and b must agree, as must those at c and d , since no substitutions occur on terminal edges.Then the matrix Flat A | B ( P ) arises from the joint distribution of states at the internalnodes v and v , and its only non-zero entries are p iijj . Thus the matrix flattening for thesplit A | B displayed by T has formFlat A | B ( P ) = (1 ,
1) (1 ,
2) (2 ,
1) (2 , , p p (1 ,
2) 0 0 0 0(2 ,
1) 0 0 0 0(2 , p p , with rank at most 2 = mκ .In contrast, the flattening for the split A ′ | B ′ not displayed on T has formFlat A ′ | B ′ ( P ) = (1 ,
1) (1 ,
2) (2 ,
1) (2 , , p ,
2) 0 p ,
1) 0 0 p ,
2) 0 0 0 p , which generically has rank 4 = ( mκ ) > mκ .If the terminal edges of T are of positive length, then the resulting joint distribution P can be obtained by a simple and generically rank-preserving linear action on the rowsand columns of the flattenings above. Thus, flattenings respecting the topology of T generically have rank mκ while those that do not generically have larger rank. Proof of Proposition 5.1.
To show claim (1), suppose the split A | B is displayed on T withassociated edge e = ( v A , v B ). Let M A be the mκ × κ | A | matrix and M B the mκ × κ | B | matrixgiving the conditional probabilities of jointly observing states at A and B , conditionedon states at v A and v B respectively. Then, by rooting the tree at v A and letting M e denote the mκ × mκ Markov matrix associated to e , the joint distribution of ( v A , v B ) isdiag( Π ) M e and it follows thatFlat A | B ( P ) = M TA diag( Π ) M e M B . Since rank( M e ) ≤ mκ , it follows that Flat A | B ( P ) has rank at most mκ . For claim (2), suppose now A | B is not displayed on T . Let V be the variety of matrices ofsize κ | A | × κ | B | with rank at most mκ , defined by the set of all ( mκ + 1) × ( mκ + 1) minors.By Proposition 3.9, it suffices to find a single choice of P M ( T, κ, m ) parameters thatproduces a point off V , as the parameterization extends to a complex analytic function.Since T does not display A | B , by Theorem 3.8.6 of Semple and Steel [2003], there isan edge e = ( v , v ) of T with associated split C | D such that A ′ = A ∩ C , A ′′ = A ∩ D , B ′ = B ∩ C , B ′′ = B ∩ D are all non-empty. To find the needed choice of parameters, fixall internal edges of T except e to have length 0, so the Markov matrices on these edgesare I , and fix the edge lengths of all terminal edges and e to be 1. See Figure 3. Take R = and mixing weights w i = 1 /m to be uniform. Values for the parameters π i , r i willbe specified later in the argument. For this choice of parameters, T is formed by joiningtwo star trees at the ends of e . v v A ′ B ′ C A ′′ B ′′ De Figure 3.
A tree T which does not display the split A | B , but displays thesplit C | D such that A ′ = A ∩ C , A ′′ = A ∩ D , B ′ = B ∩ C , B ′′ = B ∩ D areall non-empty.Taking r = v to be the root of T , let K = diag( Π ) M e be the mκ × mκ block diagonalmatrix which is the joint distribution of classes and states at v and v . The probabilitiesof observing states i , j , k , l at leaves in A ′ , B ′ , A ′′ , B ′′ respectively, P ( i , j , k , l ), are theentries of a κ | A ′ | × κ | B ′ | × κ | A ′′ | × κ | B ′′ | tensor.Define a mκ × mκ × mκ × mκ tensor Q , Q ( i, j, k, l ) = ( K ( i, k ) i = j, k = l, . The tensor Q is the joint distribution of states at the leaves of the tree T of Figure 3 whenterminal edges have length zero and A ′ , B ′ , A ′′ , B ′′ are single taxa. Indeed, since A | B is ROFILE MIXTURE MODEL IDENTIFIABILITY 15 not displayed on T , the matrix b Q = Flat A | B ( Q ) is ( mκ ) × ( mκ ) with entries b Q (( i, j ) , ( k, l )) = Q ( i, k, j, l ) . Since K is block diagonal, b Q has at most mκ nonzero entries, all appearing on thediagonal, and b Q is generically of rank mκ .To see that in the general case Flat A | B ( P ) has a similar structure, let N A = M A ′ ⊗ M A ′′ and N B = M B ′ ⊗ M B ′′ where M A ′ , M A ′′ , M B ′ , M B ′′ are given as in equation (4) of Definition4.1. Then(5) Flat A | B ( P ) = N TA b Q N B . (a) v v e A ′′ B ′′ A ′ B ′ (b) v v e A ′′ B ′′ A ′ B ′ Figure 4.
Trees with (a) | A ′ | = | B ′ | = 2 and | A ′′ | = | B ′′ | = 1, and (b) | A ′ | = | B ′′ | = 2 and | A ′′ | = | B ′ | = 1 .We now establish that claim (2) holds when | A | = | B | = 3, so the tree is one of thoseshown in Figure 4. Suppose first that | A ′ | = | B ′ | = 2 and | A ′′ | = | B ′′ | = 1, as shownfor tree (a) of the figure. In this case N A = N B . Since b Q is diagonal with at most mκ non-zero entries due to the block structure of K , in equation (5) we can replace b Q by adiagonal mκ × mκ matrix Q by eliminating zero rows and columns. To do this, we mustalso replace N A = N B with an mκ × κ matrix N formed by taking tensor products ofthe individual class components of M A ′ = M ⊗ r and M A ′′ = M and then restacking. Tobe concrete, for class c the Markov matrix for a terminal edge is M c = M ( a c , . . . , a cκ ) byLemma 4.3, and N is formed by stacking m matrices ( M c ) ⊗ r ⊗ M c .Since Q is diagonal with generically positive entries, using equation (5) we have thatFlat A | B ( P ) = (cid:0) N T Q / (cid:1) (cid:0) Q / N (cid:1) = Λ T Λ , where Λ = Q / N . By the singular value decomposition, it follows thatrank(Λ T Λ) = rank(Λ) = rank( N ) . The
Pari/GP calculation presented in Proposition 3.13, together with Proposition 3.9show that rank( N ) > mκ generically, and thus for generic π i and r i it follows thatrank(Flat A | B ( P )) > mκ .Now continuing with | A | = | B | = 3 suppose that | A ′ | = | B ′′ | = 2 and | A ′′ | = | B ′ | = 1, asshown by Figure 4(b). The previous argument fails for this tree because now N A = N B , asthe tensor products defining these matrices, are taken in different orders. However, a morecomplicated Pari/GP calculation, presented as Proposition 3.14, shows that Flat A | B ( P )generically has rank greater than mκ in this case. Finally, for the general case of | A | , | B | ≥
3, take b A to be a 3-element subset of A with atleast one element from A ′ and one from A ′′ , and similarly take b B to be a 3-element subsetof B with at least one element from B ′ and from B ′′ . Let b P be the probability distributionfor the taxa b A ∪ b B . Since the row indices of Flat A | B ( P ) depend on the states at the taxain A and the column indices depend on the states at the taxa in B , marginalizing overall possible states for the taxa in A which are not in b A , and similarly for B , gives thematrix Flat b A | b B ( b P ). There exist matrices, J , J which perform this marginalization onFlat A | B ( P ), J Flat A | B ( P ) J = Flat b A | b B ( b P ) . Since Flat b A | b B ( b P ) generically has rank greater than mκ and Flat A | B ( P ) has rank greaterthan or equal to Flat b A | b B ( b P ) by this equation, it follows that Flat A | B ( P ) generically hasrank greater than mκ . (cid:3) As a consequence of Proposition 5.1, from a distribution P computed from generic PMmodel parameters we can identify every edge in the tree for which there are at least threetaxa on either side, by computing ranks of flattenings of P . In the following, we see thatProposition 5.1 also helps to identify at least one tripartition on the tree. Proposition 5.2.
Let T be an n -taxon tree on X with n ≥ , and P a joint distributionfrom generic parameters for the model P M ( T, κ, m ) with κ = 20 and m < . Then thereis at least one tripartition A | B | C displayed on T , with | A | , | B | ≥ , which can be identifiedfrom P .Proof. By Lemma 4 . T with n ≥ v which induces a tripartition A | B | C such that two of thethree components contain at least ⌈ n/ ⌉ leaves of T .The two edges incident to v that correspond to subsets of X with at least ⌈ n/ ⌉ leavesare generically identifiable by Proposition 5.1, since for n ≥ ⌈ n/ ⌉ ≥
3. If the thirdedge incident to v has 3 or more taxa in its component, it also can be identified. Thus, itremains to establish that the third edge incident to v can be identified when the numberof taxa in its component is 1 or 2. Examples of such trees are illustrated for n = 9 inFigure 5.( a ) ( b ) Figure 5.
Examples of 9-taxon trees with internal vertex v inducing A | B | C with | A | , | B | ≥ | C | = 1 or 2. ROFILE MIXTURE MODEL IDENTIFIABILITY 17
If the third component has only one leaf, as in Figure 5(a), the two bipartitions A ∪{ c }| B and A | B ∪ { c } are identifiable by Proposition 5.1. Together this implies that thetripartition induced by v is A | B |{ c } . If the third component has two leaves as in Figure 5(b), the two splits A ∪ { c , c }| B and A | B ∪ { c , c } are identifiable, but A ∪ { c }| B ∪ { c } and A ∪ { c }| B ∪ { c } are not displayed on T , and that can be detected by Proposition5.1. This implies the tripartition A | B |{ c , c } is on the tree. (cid:3) With a tripartition on the tree identifiable by the preceding proposition, we prepare toapply Kruskal’s Theorem. Letting P be a joint distribution from P M ( T, κ, m ), pick aninternal vertex v of T inducing such a tripartition A | B | C . Then by Lemma 4.2Flat A | B | C ( P ) = [ π ; M A , M B , M C ] = [ M A , M B , M C ] , where M A = diag( Π ) M A . Provided the Kruskal ranks of the matrices M A , M B , M C arelarge enough, at least generically, Kruskal’s theorem can be applied. The next threelemmas establish this. Lemma 5.3.
Consider the model
P M ( T, , m ) with m ≤ . If ℓ ≥ , then the ℓ th rowtensor power of the mκ × κ Markov matrix associated to a terminal edge of T has full rowrank for generic parameters.Proof. Using Proposition 3.9, it is enough to show there is a single choice of parametersfor which the tensor power has full row rank. Let R = , and take the terminal branchlengths to be 1. Then by Lemma 4.3 the Markov matrix M e on a terminal edge hasthe form of stacked matrices of the form M ( a , . . . , a κ ). By the Pari/GP calculation ofProposition 3.11, for generic choices of the other parameters, M ⊗ ℓr e , ℓ ≥
3, has full rowrank. (cid:3)
Using Proposition 3.12 in a similar argument we obtain the following.
Lemma 5.4.
Consider the model
P M ( T, κ, m ) with κ ≥ and m ≥ . Then for ℓ ≥ ,the ℓ th row tensor power of the mκ × κ Markov matrix associated to a terminal edge of T generically has Kruskal rank at least 2. Lemma 5.5.
For a distribution from the model
P M ( T, κ, m ) with κ = 20 and m ≤ ,let M A , M B , M C be the matrices described above. If | A | , | B | ≥ , and | C | ≥ , thengenerically M A , M B have full Kruskal rank and M C has Kruskal rank at least .Proof. Using Proposition 3.9, we need only show there is a single choice of parameters forwhich these rank claims hold. Set all internal branch lengths 0 and all terminal branchlengths 1, so that T is a star tree rooted at the central node v . Then by Lemma 5.3, since | A | , | B | ≥ π i the matrices M A (and therefore M A )and M B have full row rank and therefore full Kruskal rank. Also by Lemma 5.4, M C hasKruskal rank at least 2. (cid:3) We add the last ingredient before the main result.
Proposition 5.6.
Suppose T is a tree on X which displays a known tripartition A | B | C corresponding to vertex r with | A | , | B | ≥ , | C | ≥ . If κ = 20 and m ≤ then both T and the numerical parameters of the PM ( T, κ, m ) model are generically identifiable, up toarbitrary rescaling of the tree and the exchangeability matrix R .Proof. Using the notation and result of Lemma 5.5, if a distribution P comes from genericparameters of P M ( T, κ, m ), thenFlat A | B | C ( P ) = [ M A , M B , M C ] , where M A , M B have full Kruskal rank and M C has Kruskal rank at least 2. Thus equation(2) of Theorem 3.6 is satisfied with l = mκ , and M A , M B , M C are determined uniquelyup to simultaneous permutation and scaling of the rows.Also, by factoring out row sums from the matrices, we can generically identify the rootdistribution vector Π at the node r and M A , M B , M C up to simultaneous permutation ofthe entries of Π and the rows of the matrices. Considering any entry of Π , and supposingthat this corresponds to an unknown class u ∈ [ m ] and state w ∈ [ κ ], then the same rowsof M A , M B , M C correspond to the same class u and state w . Since Kruskal’s theoremyields identifiability only up to permutation, we must determine which of the mκ rows of M A , M B , M C correspond to the same fixed class u .Consider first the special case that | A | = 3 where A = { a, b, c } . Then T , which isgenerically binary, has a subtrees rooted at r , with leaves A = { x, y, z } as shown inFigure 6, though we do not know which two taxa from a, b, c form the cherry { y, z } . y z xv v rM y M z M x M M Figure 6.
A subtree of T with leaves A = { a, b, c } = { x, y, z } .The Markov matrix M A is of size mκ × κ . Choose the ℓ th row of M A where ℓ = ( u, w )for unknown u, w . It is a row vector with κ entries, but we can reconfigure it as a 3-dimensional tensor of size κ × κ × κ so its ( i, j, k )-entry is P ( a = i, b = j, c = k | r = ℓ ).Since the PM model is time reversible, take v as the root of the subtree in Figure 6.Then for unknown 1 × κ vector π v , and κ × κ Markov matrices M x , M y , M z , M , M for ROFILE MIXTURE MODEL IDENTIFIABILITY 19 class u on this subtree, the joint distribution of states at x, y, z, r for fixed class u is P ( x = i, y = j, z = k, r = ( u, w ))= κ X α =1 κ X β =1 π v ( β ) M y ( β, j ) M z ( β, k ) M ( β, α ) M ( α, w ) M x ( α, i )= κ X β =1 π v ( β ) M y ( β, j ) M z ( β, k ) κ X α =1 M ( β, α ) M ( α, w ) M x ( α, i ) ! = κ X β =1 π v ( β ) M y ( β, j ) M z ( β, k ) c M ( u,w ) ( β, i ) = [ π v ; M y , M z , c M ( u,w ) ] , where c M ( u,w ) = M diag( M ( · , w )) M x with M ( · , w ) denoting the w th column of M . Forfixed u this is simply a rescaling of the conditional distribution P ( x = i, y = j, z = k | r = ( u, w )) given in the ℓ th row of M A .Thus applying Kruskal’s theorem to each row of M A reshaped into such a 3-way tensor,we can decompose P ( x = i, y = j, z = k | r = ℓ ) for each ℓ = ( u, w ) into a tripleproduct, as the matrices generically all have rank κ . Note that for each ℓ = ( u, w ),Kruskal’s theorem gives the matrices M y , M z , c M ( u,w ) up to ordering of their κ rows. Twoof these matrices, M y , M z , will be dependent only on the class u , but not the state w . Soconsidering all ℓ = ( u, w ), we can find κ rows of M A with the same (possibly permutedrows) version of M y and M z which correspond to a single class u . In this way we cangroup the rows of M A , M B , M C with entries of Π by class u . Now taking those rows of M A , M B , M C , and entries of Π for one class u and reassembling them in a 3-way productgives a tensor for a single class GTR model on the full tree T . Both the tree T andnumerical parameters are identifiable for this single-class model by Theorem 2.4.For the general case, suppose | A | , | B | ≥
3. Then by marginalization down to | A | = 3we can identify the subtrees and parameters for B, C . Then interchanging the roles of A and B identifies the subtree and parameters for A . (cid:3) Combining Proposition 5.2 with Proposition 5.6, we have proved the main result.
Theorem 5.7.
Let T be a tree with at least taxa. Then under the PM ( T, , m ) modelwith m < , both T and numerical parameters are generically identifiable, up to arbitraryrescaling of the tree and the exchangeability matrix R . Theorem 5.7 extends to certain tree shapes with fewer than 9 taxa. To apply Proposition5.6, T must display a tripartition with two of its subsets of size at least 3, so that T musthave at least 7 taxa. Such a tripartition will be generically identifiable by the argumentgiven for Proposition 5.2. Corollary 5.8.
For the profile mixture model
P M ( T, , m ) with m < , parameters aregenerically identifiable if T has any of the -taxon tree shapes (a)-(d) shown in Figure 7,or the -taxon caterpillar shape. ( a ) ( b ) ( c )( d ) ( e ) Figure 7.
All binary unrooted tree shapes for 8 taxa. Parameters of thePM model are generically identifiable for trees (a)-(d). The arguments ofthis paper do not answer the identifiability question for tree (e).
Acknowledgments
This research was supported, in part, by the National Institutes of Health Grant R01GM117590, awarded under the Joint DMS/NIGMS Initiative to Support Research at theInterface of the Biological and Mathematical Sciences.
Author Disclosure Statement
No competing financial interests exist.
References
E.S. Allman and J.A. Rhodes. The identifiability of tree topology for phylogenetic models,including covarion and mixture models.
J. Comput. Biol. , 13:1101–1113, 2006. doi:10.1007/s00285-010-0355-7.E.S. Allman and J.A. Rhodes. Identifying evolutionary trees and substitution parametersfor the general Markov model with invariable sites.
Math. Biosci. , 211(1):18–33, 2008.E.S. Allman and J.A Rhodes. The identifiability of covarion models in phylogenetics.
IEEE/ACM Trans. Comput. Biol. Bioinform. , 6(1):76–88, 2009.E.S. Allman, C. An´e, and J.A. Rhodes. Identifiability of a Markovian model of molecularevolution with gamma-distributed rates.
Adv. in Appl. Probab. , 40:229–249, 2008.E.S. Allman, M.T. Holder, and J.A. Rhodes. Estimating trees from filtered data: Identi-fiability of models for morphological phylogenetics.
J. Theor. Biol. , 263:108–119, 2010.E.S. Allman, S. Petrovi´c, J.A. Rhodes, and S. Sullivant. Identifiability of two-tree mixturesfor group-based models.
IEEE/ACM Trans. Comput. Biol. Bioinform. , 8(3):710–722,2011.E.S. Allman, C. Long, and J.A. Rhodes. Species tree inference from genomic sequencesusing the log-det distance.
SIAM J. Appl. Algebra Geometry , 3(1):1–30, 2019.J. Chai and E.A. Housworth. On Rogers’s Proof of Identifiability for the GTR + Gamma+ I Model.
Syst. Biol. , 60(5):713–718, 2011.
ROFILE MIXTURE MODEL IDENTIFIABILITY 21
J. Chifman and L. Kubatko. Identifiability of the unrooted species tree topology under thecoalescent model with time-reversible substitution processes, site-specific rate variation,and invariable sites.
J. Theor. Biol. , 374:35–47, 2015.B. Hollering and S. Sullivant. Identifiability in phylogenetics using algebraic matroids. arXiv:1909.13754 , 2019.D.T. Jones, W.R. Taylor, and J.M. Thornton. The rapid generation of mutation datamatrices from protein sequences.
Comput. Appl. Biosci. , 8(3):275–82, 1992. doi: 10.1093/bioinformatics/8.3.275.J.B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, withapplication to arithmetic complexity and statistics.
Linear Algebra Appl. , 18(2):95–138,1977. doi: 10.1016/0024-3795(77)90069-6.N. Lartillot and H. Philippe. A Bayesian mixture model for across-site heterogeneities inthe amino-acid replacement process.
Mol. Bio. Evol. , 21:1095–1109, 2004.N. Lartillot, H. Brinkmann, and H. Philippe. Suppression of long-branch attraction arte-facts in the animal phylogeny using a site heterogeneous model.
BMC Evol. Biol. , 7:S4, 2007.N. Lartillot, T. Lepage, and S. Blanquart. PhyloBayes 3: a Bayesian software package forphylogenetic reconstruction and molecular dating.
Bioinformatics , 25:2286–2288, 2009.N. Lartillot, N. Rodrigue, Dl Stubbs, and J. Richer. PhyloBayes MPI: Phylogeneticreconstruction with infinite mixtures of profiles in a parallel environment.
Syst. Biol. ,62(4):611–615, 2013.S.Q. Le, O. Gascuel, and N. Lartillot. Empirical profile mixture models for phylogeneticreconstruction.
Bioinformatics , 24(20):2317–2323, 08 2008. ISSN 1367-4803. doi: 10.1093/bioinformatics/btn445.C. Long and S. Sullivant. Identifiability of 3-class Jukes-Cantor mixtures.
Adv. in Appl.Math. , 64:89–110, 2015. ISSN 0196-8858. doi: 10.1016/j.aam.2014.12.003.R.M. Range.
Holomorphic Functions and Integral Representations in Several ComplexVariables , volume 108 of
Graduate Texts in Mathematics . Springer-Verlag, New York,1986.B. Rannala. Identifiability of parameters in MCMC Bayesian inference of phylogeny.
Syst.Biol. , 51(5):754–760, 2002. doi: 10.1080/10635150290102429.J.A. Rhodes and S. Sullivant. Identifiability of large phylogenetic mixture models.
Bul.Math. Biol. , 74:212–231, 2012. doi: 10.1007/s11538-011-9672-2.C. Semple and M. Steel.
Phylogenetics , volume 24 of
Oxford Lecture Series in Mathematicsand its Applications . Oxford University Press, Oxford, 2003.The PARI Group.
PARI/GP version . Univ. Bordeaux, 2019. URL http://pari.math.u-bordeaux.fr/ .H.-C. Wang, K. Li, E. Susko, and A.J. Roger. A class frequency mixture model thatadjusts for site-specific amino acid frequencies and improves inference of protein phy-logeny.
BMC Evol. Biol. , 8(331):1–13, 2008. doi: 10.1186/1471-2148-8-331.H.-C. Wang, E. Susko, and A.J. Roger. An amino acid substitution-selection modeladjusts residue fitness to improve phylogenetic estimation.
Mol. Biol. Evol. , 31(4):779–792, 2014.
M. Wascher and L. Kubatko. Consistency of SVDQuartets and maximum likelihood forcoalescent-based species tree estimation.
Syst. Biol. , 2020. in press.S. Whelan and N. Goldman. A general empirical model of protein evolution derived frommultiple protein families using a maximum-likelihood approach.
Mol. Bio. Evol. , 18(5):691–9, 2001. doi: 10.1093/oxfordjournals.molbev.a003851.Z. Yang. Maximum likelihood phylogenetic estimation from DNA sequences with variablerates over sites: approximate methods.
J. Mol. Evol. , 39:306–14, 1994.
Department of Mathematics and Statistics, University of Alaska Fairbanks, 99775
E-mail address : [email protected] Department of Mathematics and Statistics, University of Alaska Fairbanks, 99775
E-mail address : [email protected] Department of Mathematics and Statistics, University of Alaska Fairbanks, 99775
E-mail address ::