[PDF] Parameter identifiability for a profile mixture model of protein evolution

Abstract

A Profile Mixture Model is a model of protein evolution, describing sequence data in which sites are assumed to follow many related substitution processes on a single evolutionary tree. The processes depend in part on different amino acid distributions, or profiles, varying over sites in aligned sequences. A fundamental question for any stochastic model, which must be answered positively to justify model-based inference, is whether the parameters are identifiable from the probability distribution they determine. Here we show that a Profile Mixture Model has identifiable parameters under circumstances in which it is likely to be used for empirical analyses. In particular, for a tree relating 9 or more taxa, both the tree topology and all numerical parameters are generically identifiable when the number of profiles is less than 74.

Full PDF

aa r X i v : . [ q - b i o . P E ] J u l PARAMETER IDENTIFIABILITY FOR A PROFILE MIXTUREMODEL OF PROTEIN EVOLUTION

SAMANEH YOURDKHANI, ELIZABETH S. ALLMAN, AND JOHN A. RHODES

Abstract.

A Proﬁle Mixture Model is a model of protein evolution, describing sequencedata in which sites are assumed to follow many related substitution processes on a singleevolutionary tree. The processes depend in part on diﬀerent amino acid distributions, orproﬁles, varying over sites in aligned sequences. A fundamental question for any stochas-tic model, which must be answered positively to justify model-based inference, is whetherthe parameters are identiﬁable from the probability distribution they determine. Herewe show that a Proﬁle Mixture Model has identiﬁable parameters under circumstancesin which it is likely to be used for empirical analyses. In particular, for a tree relating9 or more taxa, both the tree topology and all numerical parameters are genericallyidentiﬁable when the number of proﬁles is less than 74. Introduction

A Proﬁle Mixture model is a certain stochastic model of protein sequence evolutionthat describes the changes in sequences along the tree of evolutionary relationships of acollection of taxa. Such a model is often used for the inference of the tree from sequencedata, using standard maximum likelihood or Bayesian statistical frameworks. Here weinvestigate the question of parameter identiﬁability for this model: Are the model pa-rameters — both the tree topology and numerical ones — determined by a site patterndistribution arising from the model? Parameter identiﬁability, which informally meansthat valid parameter inference is possible in ideal circumstances, is an essential componentof the theoretical justiﬁcation for standard statistical inference approaches.In models of protein sequence generation, amino acid site patterns are generally assumedto be independent and identically distributed across the sites. Common continuous-timemodels of amino acid substitutions are instances of the general time-reversible model (GTR) which assumes a single rate matrix Q constant over a metric tree, or extensionsthat allow for additional scalar rate variation at individual sites. The rate matrix Q hasoﬀ-diagonal entries from R diag( π ), where R is a symmetric matrix of exchangeabilities and π is a vector of frequencies of the amino acids which remains stable under the model.In principle, one can infer R , π , and a metric tree of taxon relationships from pro-tein sequence data using standard statistical frameworks. However, with 20 amino acidsthe state space for the model is large, so an exchangeability matrix R is often ﬁxedin advance, having been previously determined empirically for particular types of data.Well-known exchangeabilities for protein alignments include the JTT [Jones et al., 1992],WAG [Whelan and Goldman, 2001], and LG [Le et al., 2008] matrices. Date : June 30, 2020.

When inspecting protein sequence data, however, it is often clear that the GTR as-sumption of identically distributed sites is a poor one, since sites have visibly diﬀerentamino acid compositions. Site residue distributions, or proﬁles , likely diﬀer because ofbiophysical properties of amino acids (e.g., hydrophilia, polarity, or charge), and the asso-ciated structural and functional constraints on the protein. This phenomenon suggests amodel with multiple classes of substitution processes, and in particular a mixture modelusing a variety of proﬁles with the same exchangeabilities for all classes. Mixture mod-els can provide better ﬁt to data as they introduce more parameters, though they alsoincrease computational time and may lead to overﬁtting of the data.But a more fundamental issue with adopting a mixture model is that one may loseparameter identiﬁability. If several choices, or even more worrisome, inﬁnitely manychoices of parameters lead to the same probability distribution under the model, theneven with an idealized inﬁnite data set perfectly in accord with the model one couldnot recover the parameter values under which the data arose. Since the goal of mostphylogenetic analyses is to infer model parameters — generally the topological tree butoften numerical parameters as well — identiﬁability is an essential property for a modelto be useful. Non-identiﬁability poses particular challenges in Bayesian MCMC analyses,where it may be manifested as a lack of convergence [Rannala, 2002].For non-mixture site substitution models in phylogenetics parameter identiﬁability haslong been established, but mixture models provide greater challenges. Although compu-tational work may suggest whether it holds or fails, parameter identiﬁability can onlybe established theoretically as it is a model property, and not dependent on an inferencemethod. In recent years algebraic methods have been introduced and successfully appliedto a number of phylogenetic mixture models, see, for example, Allman and Rhodes [2006,2008, 2009], Allman et al. [2010, 2011, 2019], Chifman and Kubatko [2015], Long and Sullivant[2015], Hollering and Sullivant [2019], Wascher and Kubatko [2020]. While one of theseworks [Rhodes and Sullivant, 2012] established a rather general result on parameter iden-tiﬁability of phylogenetic mixture models with many components, it unfortunately doesnot apply to the proﬁle mixture model’s speciﬁc structure.In this work, we prove parameter identiﬁability for a

Proﬁle Mixture Model (PM) of amino acid site substitution. PM models were introduced in the Bayesian context[Lartillot and Philippe, 2004, Lartillot et al., 2009, 2013] where the number of proﬁlesmight be inferred using a Dirichlet process prior, and as ﬁnite mixtures with a ﬁxed num-ber of components in a Maximum Likelihood analysis [Le et al., 2008]. Studies suggestthat PM models perform better than single-class models, particularly on data that is satu-rated or with an underlying long branch attraction bias [Lartillot et al., 2007, Wang et al.,2008]. Mixtures with as many as 60 classes have been investigated with empirical datasets, with indications that around 20 proﬁles often provides good ﬁt [Le et al., 2008]. Fora recent study assessing the performance under simulation of mixture models includingdiscrete-Γ rates-across-sites and PM models, see Wang et al. [2014].Our main result, Theorem 5.7, establishes that parameters of a proﬁle mixture modelwith up to 73 classes on a tree of 9 or more taxa, are generically identiﬁable ; that is,identiﬁable outside an exceptional parameter set of measure zero. For any ﬁxed number

ROFILE MIXTURE MODEL IDENTIFIABILITY 3 of classes, the parameters include the tree topology, the tree’s edge lengths, the exchange-abilities, the proﬁles, and weights of the mixture components.The proof techniques we employ are algebraic in nature, using ideas from tensor decom-position and algebraic geometry. These tools, which have been introduced and used pre-viously for phylogenetic models [Allman and Rhodes, 2006, 2009, Rhodes and Sullivant,2012], are based in the algebraic properties of matrices and 3-way tensors obtained fromrearranging the entries of the distribution of site pattern frequencies. However, the struc-ture of the PM model, with proﬁles varying over classes while the exchangeabilities donot, introduce important diﬀerences that prevent any easy deduction of the result fromprevious work. At several points in our arguments we use exact integer computation,performed by the software

Pari/GP [The PARI Group, 2019], to establish certain genericconditions we need on ranks of matrices.As motivated by applications to amino acid models, our main theorem is stated forthe proﬁle mixture model with a state space of size 20. However, the techniques used forestablishing it apply to arbitrary sizes κ of the state space. For example, κ might be 4for DNA, or 61 for codons. However, appropriate rank computations would need to becarried out to complete the proof in such contexts. In the κ = 20 setting we also believethe proof techniques could be pushed to establish identiﬁability for more than 73 proﬁles,at the expense of requiring more taxa on the tree.This paper is organized as follows: In Section 2 we introduce phylogenetic substitutionmodels, and in particular the proﬁle mixture model under study. Section 3 provides al-gebraic deﬁnitions and lemmas, though removed from the biological setting of interest.Section 4 then connects the phylogenetic proﬁle mixture model with these algebraic no-tions. We conclude in Section 5 with the proof of our main theorem on identiﬁability ofthe PM model parameters. 2. Markov Models on Trees

We begin by introducing Markov models of site substitution along a tree. Throughout,let κ be the size of the state space, which we identify with [ κ ] = { , , , . . . , κ } . Forprotein data, κ = 20. Let T ρ be a rooted topological tree, with root ρ and leaves labelledby elements of the taxon set X . The general Markov model of κ -state sequence evolutionalong T ρ is parameterized by 1) A 1 × κ vector π giving the distribution of states atthe root; and 2) for each edge e directed away from the root, a κ × κ Markov matrix M e giving the conditional probabilities of state transitions along e . These determine theexpected site pattern frequency array, or joint distribution of states at the leaves, whichwe view as a κ × κ × · · · × κ | {z } n array or tensor, P . Each site in an alignment is modeled asindependent and identically distributed according to P .A subclass of general Markov models is composed of the general time-reversible models(GTR) . For a GTR model, there is an single underlying rate matrix Q , and for each edge e of T ρ a length t e with M e = exp( Qt e ). Time-reversibility is the assumption that for somesymmetric κ × κ matrix R of non-negative exchangeabilities and the root distribution π thediagonal entries of Q are those of the product R diag( π ), with the diagonal entries chosen SAMANEH YOURDKHANI, ELIZABETH S. ALLMAN, AND JOHN A. RHODES so that row sums are zero. This results in diag( π ) Q = Q T diag( π ). One consequence oftime-reversibility is that the Markov matrix M e is independent of the direction of e . Itfollows that the tree parameter in a GTR model is de facto unrooted since the locationof the root is not identiﬁable. We repeatedly take advantage of this to ‘move the root’ tolocations in T convenient for our arguments.Proﬁle mixture models are ﬁnite mixtures of GTR models, where the underlying ex-changeability matrix R is the same for each class. The particular proﬁle mixture modelexamined here has parameters as follows. Deﬁnition 2.1.

Let T be a rooted topological tree, κ ≥ a number of states, and m ≥ a number of classes. Then the numerical parameters of the Proﬁle Mixture Model on T ,PM=PM ( T, κ, m ) , are: (1) a collection of non-negative branch lengths { t e } , one for each edge e of T ; (2) a symmetric κ × κ matrix R of non-negative exchangeabilities; (3) a collection of m class weights { w i } , with w i > and P w i = 1 ; and (4) For each class i = 1 , , . . . , m , − a × κ root distribution vector π i , called a proﬁle ; and − a scalar rate parameter r i ≥ . The scalar rate parameters { r i } are used to incorporate across-site rate variation intothe PM model. Speciﬁcally, for class i with Q i the rate matrix determined by R , π i ,the Markov matrix on edge e in T is M ei = exp( r i Q i t e ). We note that site rate vari-ation for PM models may be implemented diﬀerently in software, with a rate for eachsite [Lartillot and Philippe, 2004] or with a discrete-Γ(4) [Le et al., 2008]. In the ﬁrstimplementation, the PM model is very likely overparameterized and ideally the MCMCwould limit the number of rate multipliers. Implementation of the rate variation usinga discrete-Γ has a long history in computation phylogenetics [Yang, 1994], but proofsof such rate variation identiﬁability are only known for the continuous Γ [Allman et al.,2008, Chai and Housworth, 2011].While probability distributions from mixture models are often described as weightedsums of distributions from the various classes, phylogenetic mixture models can be equiv-alently presented as a single model on a tree T with mκ states at internal nodes of T ,and κ states at the leaves. The internal states are pairs ( i, j ) where i is a class and j ∈ [ κ ] is a ‘usual’ state. In this formulation, Markov matrices on internal edges e for thePM model are mκ × mκ block diagonal matrices, where the the m blocks are the M ei , i = 1 , . . . , m . The block structure prevents changes from one class to another, thoughthe ‘usual’ states may change within the class. For the terminal edges e of T , leading toleaves where the class information is not observable, the PM Markov matrix for an edgeis formed by stacking the m Markov matrices M ei for the classes. The root distributionis an mκ vector formed by concatenating w i π i for the classes.We collect these observations for parameterizing the PM model on a tree. Deﬁnition 2.2.

Given parameters for the proﬁle mixture model

P M ( T, κ, m ) , assumethat T is rooted at r . Then the × mκ vector Π = Π r = ( w π , w π , . . . , w κ π κ ) , the mκ × mκ matrices M e = exp ( Qt e ) where Q is block diagonal with blocks r i Q i for each ROFILE MIXTURE MODEL IDENTIFIABILITY 5 internal edge e of length t e , and the mκ × κ matrix M e formed by stacking the matrices M ei for each class i on a terminal edge give a parameterization of the PM model as aMarkov model of site substitution on T . Since our main goal is to prove parameter identiﬁability for the PM model, we formallydeﬁne the notion of generic identiﬁability.

Deﬁnition 2.3.

Consider a parametric model, speciﬁed by a parameterization map φ fromsome parameter space to a space of probability distributions. If φ is one-to-one, then themodel parameters are identiﬁable . If φ is one-to-one except possibly on a subset of measurezero in the parameter space, then the model parameters are generically identiﬁable . It is well known that for the GTR model some normalization is needed for rates andbranch lengths since Qt = ( sQ ) (cid:0) ts (cid:1) shows rescaling all rates in Q can be oﬀset by decreas-ing branch lengths. Once understood and addressed, this model overparameterization, orlack of identiﬁability, is of little consequence. Typically, the rate matrix Q is normalizedso that branch lengths are measured in expected number of substitutions per site over theelapsed time. In the strictest sense, only the normalized variant of the GTR model hasidentiﬁable parameters, a result used in our proof of the main theorem. Theorem 2.4.

For a single class GTR model on an unrooted metric tree, the tree topologyand all numerical parameters are generically identiﬁable, up to a normalization of Q . Algebraic Definitions and Lemmas

In this section we collect algebraic deﬁnitions and theorems that will play a role in ouranalysis of the PM model. We present these in a purely algebraic setting, deferring theconnection to the phylogenetic models, and in particular the PM model, to later sections.We begin by deﬁning tensors and certain algebraic operations on them leading up to atheorem of J. Kruskal on the structure of 3-way tensors, an important tool that we willuse several times. We then brieﬂy introduce algebraic varieties and conclude by stating atheorem for identifying generic properties, a tool also used repeatedly in our proofs.3.1.

Tensors.

Our ﬁrst deﬁnition is a standard one.

Deﬁnition 3.1.

Let A be an m × k matrix and B be an n × l matrix. The tensor , or Kronecker, product A ⊗ B is the mn × kl matrix whose rows are indexed by the orderedpair ( i , j ) , i ∈ [ m ] , j ∈ [ n ] and whose columns are indexed by ordered pair ( i , j ) , i ∈ [ k ] , j ∈ [ l ] such that the (( i , j ) , ( i , j )) entry is ( A ⊗ B ) ( i ,j ) , ( i ,j ) = a i i b j j . Less standard is the following.

Deﬁnition 3.2.

Let A be an m × c matrix and B be an m × c matrix. The row tensorproduct A ⊗ r B is the m × c c matrix with entries indexed by ( i, ( j, k )) for i ∈ [ m ] , j ∈ [ c ] , k ∈ [ c ] , ( A ⊗ r B ) i, ( j,k ) = a ij b ik . In the case that A = B and ℓ is a positive integer, then the ℓ th row-tensor power of A isthe m × k ℓ matrix A ⊗ ℓr = A ⊗ r A ⊗ r · · · ⊗ r A | {z } ℓ . SAMANEH YOURDKHANI, ELIZABETH S. ALLMAN, AND JOHN A. RHODES

We do not specify the precise order of row and column indices in these tensor products,since for our applications it will either be clear from context, or inconsequential. In par-ticular, we often only need results on the ranks of these products, which are independentof row and column ordering.Since Kruskal’s Theorem concerns 3-way tensors, we next describe reformatting n -waytensors into 3-way ones. Suppose P is an n -way tensor with indices labeled by X . Then a tripartition I | J | K of X is a collection three disjoint non-empty subsets of X whose unionis X , X = I ⊔ J ⊔ K . A bipartition of X , or a split , is deﬁned similarly, with the disjointsets required to be non-empty. Deﬁnition 3.3.

Let A be an n -way κ × · · · × κ tensor with I | J a split of the index set X .Then the matrix ﬂattening of A with respect to I, J , denoted

Flat I | J ( A ) , is a κ | I | × κ | J | matrix. If, by permuting indices, we assume that I = { , , · · · , | I |} , J = {| I | + 1 , · · · , n } ,then the ( i , j ) -entry is (cid:0) Flat I | J ( A ) (cid:1) i , j = A ( i , . . . , i | I | , j , . . . , j | J | ) , for i = ( i , . . . , i | I | ) and j = ( j , . . . , j | J | ) .Similarly for a tripartition I | J | K of X , the -way tensor Flat I | J | K ( A ) is (cid:0) Flat I | J | K ( A ) (cid:1) i , j , k = A ( i , j , k ) , where i ∈ [ κ ] | I | , j ∈ [ κ ] | J | , and k ∈ [ κ ] | K | .Example . Suppose A is a 20 × × × × ×

20 6-way tensor, and let I = { , } , J = { } , and K = { , , } . Then Flat I | J | K ( A ) is a 20 × × tensor with, for example, (cid:0) Flat I | J | K ( A ) (cid:1) (10 , , (8) , (15 , , = A (10 , , , , , . Kruskal’s theorem requires the notion of a 3-way tensor obtained as sum of “outerproducts” of the rows of 3 matrices.

Deﬁnition 3.4.

Let A be a k × n A matrix with i th row r Ai = (cid:0) r Ai (1) , · · · , r Ai ( n A ) (cid:1) , andsimilarly for matrices B and C of size k × n B and k × n C respectively. Then [ A, B, C ] denotes the -way n A × n B × n C tensor [ A, B, C ] = k X i =1 r Ai ⊗ r Bi ⊗ r Ci , where the tensor products in the summands are formatted to preserve an index for eachmatrix. For instance, r A ⊗ r B = ( r A ) T · r B is n A × n B , where T denotes the transpose. To illustrate, suppose that

A, B, C are 2 ×

2, 2 ×

3, and 2 × A = (cid:18) (cid:19) , B = (cid:18) (cid:19) , C = (cid:18) (cid:19) . ROFILE MIXTURE MODEL IDENTIFIABILITY 7

Then P = [ A, B, C ] is the 2 × × C index givenby P ( · , · ,

1) = (cid:18)

61 77 9382 104 126 (cid:19) , P ( · , · ,

2) = (cid:18)

74 94 114100 128 156 (cid:19) ,P ( · , · ,

3) = (cid:18)

87 111 135118 152 186 (cid:19) , P ( · , · ,

4) = (cid:18)

100 128 156136 176 216 (cid:19) . As a simple extension of Deﬁnition 3.4 for use with phylogenetic models, we write[ π ; A, B, C ] = [diag( π ) A, B, C ] = k X i =1 π i r Ai ⊗ r Bi ⊗ r Ci , where π = ( π , π , · · · , π k ) . Before the stating Kruskal’s Theorem, we need the following.

Deﬁnition 3.5.

Let A be a matrix. The Kruskal (row) rank of a matrix A is the largestnumber k such that every set of k rows of A are independent. For example, letting V denote the set of all 3 ×  a b ca b cd e f  , where ( a, b, c ) , ( d, e, f ) are independent. These matrices have rank 2 but Kruskal rank 1,and form a subset of lower dimension inside the 9-dimensional space V .It is clear that Kruskal rank is less than or equal to matrix rank, but when a matrix hasfull row rank, the two notions coincide. In subsequent sections, we exploit this observationby creating matrices with full row rank and therefore full Kruskal rank.Kruskal’s theorem can be viewed as a generic identiﬁability theorem for 3-way arrays,showing that triple products satisfying a particular rank condition are decomposable inessentially a unique way. Theorem 3.6 (Kruskal [1977]) . Let

A, B, C be l × n A , l × n B , and l × n C matrices withKruskal rank p, q, r respectively. If (2) p + q + r ≥ l + 2 , then A, B, C are uniquely determined by [ A, B, C ] , up to simultaneous permutation andscaling of their rows. More precisely, if [ A, B, C ] = [ A ′ , B ′ , C ′ ] then there exist invertiblediagonal matrices D , D and a permutation matrix P such that A ′ = P D A, B ′ = P D B, C ′ = P D − D − C. By way of contrast, note that for two compatible matrices A , B , the natural analog ofthe bracket product is the matrix product [ A, B ] = A T B . However, from [ A, B ], A and B can not be determined uniquely, since there are many matrix products that give the SAMANEH YOURDKHANI, ELIZABETH S. ALLMAN, AND JOHN A. RHODES same result. For instance, A T B = ( QA ) T ( QB ) for any orthogonal matrix Q . Kruskal’stheorem thus states a signiﬁcant diﬀerence between matrices and 3-way tensors.3.2. Generic points in parameter space.

Algebraic geometry provides a convenienttool for understanding exceptional sets, like those that fail to satisfy the rank conditionsnecessary to apply Kruskal’s Theorem. We brieﬂy give the needed deﬁnitions.

Deﬁnition 3.7.

Let S be a ﬁnite set of polynomials in C [ x , . . . , x n ] . The common zeroset in C n of the polynomials in S is the algebraic variety V ( S ) . A subset of a varietythat is itself a variety is called a subvariety . For any algebraic variety V ( S ) ⊆ C n , the ideal I ( V ( S )) is the set of all polynomials f ∈ C [ x , . . . , x n ] such that f ( v ) = 0 for all v ∈ V ( S ) . The main result of this work is that PM model parameters are identiﬁable except for‘rare’ choices. This is expressed using the following terminology.

Deﬁnition 3.8.

A property is generic on a full-dimensional subset W of R n or C n if itholds at all points of W except possibly for those points in some subset U ⊂ W of measure0. If V is an algebraic variety in C n , we say a property is generic on V if it holds at allpoints except those in a proper subvariety of V . Note that proper subvarieties of varieties always have measure 0, so these notions ofgeneric are consistent with one another.

Example . The set of 3 × V ( S ) with S = { } . The propertyof having rank, or equivalently Kruskal rank, 3 is generic on V , since matrices of rank atmost 2, including those of the form (1), lie in a ﬁnite union of lower dimensional sets. Thissubvariety of exceptional matrices is deﬁned by a single polynomial, the 3 × Proposition 3.9.

Let

Φ : U → C n be an complex analytic map with U an open subset of C ℓ . Let V be a variety in C n . Suppose f ∈ I ( V ) , and that there exists a point p = Φ( u ) with f ( p ) = 0 . Then for generic points u ∈ U or u ∈ U ∩ R n , the point Φ( u ) lies oﬀ of V .Proof. This follows from basic properties of complex analytic functions of many variables(see, for instance, the text by Range [1986]). The function f ◦ Φ is analytic, and notidentically zero. Its zero set is therefore of measure zero, so for generic u ∈ U , Φ( u ) liesoﬀ V ( f ) ⊇ V . The real points in the zero set must similarly have measure zero. (cid:3) Rank Propositions.

For the proof of our main theorem, the ranks and Kruskalranks of some special matrices arising in the PM model are needed, and we compilethese rank computations here. By giving these algebraic results in advance, the proofof Theorem 5.7 can be presented more cleanly. Note that our arguments depend inpart on some computations that were performed with the software

Pari/GP . As these

ROFILE MIXTURE MODEL IDENTIFIABILITY 9 computations were performed using exact integer arithmetic, they may be taken as validproofs, up to the usual assumptions of correct programming and no hardware faults.We begin by deﬁning a particular structured matrix that can arise from particularparameter choices for the PM model.

Deﬁnition 3.10.

With a i ∈ C for i ∈ [ κ ] , and s = a + · · · + a κ , let M ( a , . . . , a κ ) denotethe κ × κ matrix (3) M ( a , . . . , a κ ) =  a − s a · · · a κ a a − s · · · a κ ... . . . ... a a · · · a κ − s  . Proposition 3.11.

For κ = 20 and m ≤ , let M be a mκ × κ matrix formed by stacking m ≥ choices of matrices of the form M ( a , . . . , a κ ) . Then M ⊗ ℓr has full row rank forgeneric choices of the a i when ℓ ≥ .Proof. We begin with the special case of ℓ = 3. An exact Pari/GP calculation shows thatfor m = 77 by picking distinct random integers for a , . . . , a κ for each of the m blocks in M , we may ﬁnd a point p = M for which M ⊗ r has full row rank. By removing some ofthe blocks from this example if m <

77 we obtain a point p for which M ⊗ r has full rowrank for smaller m as well.To show that full row rank is a generic condition when ℓ = 3, ﬁx m ≤

77, and observethat the map from the space C mκ of the a i to M is analytic. Since p = M gives M ⊗ r fullrow rank, there is some mκ × mκ minor f of M ⊗ r which when viewed as a polynomialin the entries of M has f ( p ) = 0. Taking V = V ( f ), then Proposition 3.9 shows thatgeneric choices of the a i give f ( M ) = 0 so M ⊗ r has rank mκ .Now consider ℓ >

3. Then M ⊗ ℓr = M ⊗ r ⊗ r M ⊗ ℓ − r , where M ⊗ r = ( µ ij ) is a mκ × κ matrix and M ⊗ ℓ − r = ( α kl ) is a mκ × κ ℓ − matrix. Since M ⊗ r has full row rank mκ forgeneric M , its rows are independent. But, with v = mκ , M ⊗ r ⊗ r M ⊗ ℓ − r =  µ α µ α · · · µ κ α · · · · · · µ α µ α · · · µ κ α · · · · · · ... . . . ... µ v α v µ v α v · · · µ vκ α v · · · · · ·  , so it is enough to know that the entries of some single column of M ⊗ ℓ − r are nonzero andthat M ⊗ r has independent rows to ensure M ⊗ ℓr has independent rows. But this is truefor generic choices of parameters for M . (cid:3) The next proposition gives a lower bound on Kruskal row rank, valid for all M ⊗ ℓr . Proposition 3.12.

For κ ≥ , let M be a mκ × κ matrix formed by stacking m ≥ choices of matrices of the form M ( a , . . . , a κ ) . For ℓ ≥ , M ⊗ ℓr has Kruskal row rankgreater than or equal to for generic choices of the a i . Proof.

Consider ﬁrst the case that ℓ = 1. The matrices of Kruskal rank at most 1 forman algebraic variety V . By Proposition 3.9, it is enough to ﬁnd a single matrix M notin V to see that generically such matrices have Kruskal rank at least two. Choose mκ distinct positive small numbers as the free entries a , . . . , a κ in each block of M , so thatthe diagonal entries are the largest in the block. Then no two rows within any block M ( a , . . . , a κ ) are multiples of each other, and no two rows of diﬀerent blocks are multipleseither, since the a i ’s are distinct. Thus M has Kruskal rank greater than or equal to two.The case when ℓ > (cid:3) The ﬁnal propositions in this section involve generic ranks of stacked matrices formedby taking certain tensor products of matrices of the form above.

Proposition 3.13.

Let M be a mκ × κ matrix formed by stacking m choices of matricesof the form M ( a , · · · , a κ ) ⊗ r ⊗ M ( a , · · · , a κ ) . Then for κ = 20 and m < , the matrix M has rank greater than mκ for generic choices of the a i .Proof. A Pari/GP calculation shows that for some choice of random integers a i , M has(1) full row rank 400 > mκ = 20, when m = 1;(2) full row rank 800 > mκ = 40, when m = 2;(3) rank 1180 > mκ = 60, when m = 3; and(4) rank 1540 > mκ = 80, when m = 4.Furthermore, by (4), for m ≥

5, there exists a matrix M with rank at least 1540 = 20 × a i ’s, since we may repeat some blocks. Using Proposition 3.9, the statedrank condition on M is thus generic for all m < (cid:3) Proposition 3.14.

Let M be of the form of M in Proposition 3.13, and M be formed bystacking m matrices of the form M ( a , · · · , a κ ) ⊗ M ( a , · · · , a κ ) ⊗ r . Let L be a mκ × mκ diagonal matrix with positive entries. Then for κ = 20 and m < , M T LM has rankgreater than mκ for generic choices of the a i .Proof. Sylvester’s rank inequality givesrank( M T LM ) ≥ rank( M T ) + rank( LM ) − mκ . Since M and M diﬀer only by row and column permutations, they have the same rank.Moreover, rank( L ) = rank( LM ) since L is a diagonal matrix with positive entries. Then,by Proposition 3.13, there is a choice of a i ’s so that M T LM has rank at least(1) 400 + 400 −

400 = 400 > mκ = 20, when m = 1;(2) 800 + 800 −

800 = 800 > mκ = 40, when m = 2;(3) 1180 + 1180 − > mκ = 60, when m = 3; and(4) 1540 + 1540 − > mκ = 80, when m = 4.The rank computation for m = 4 shows additionally that there exist choices of a i givingrank( M T LM ) = 1480 for larger m , since blocks can be repeated. But 1480 = 20 ×

74 soby Proposition 3.9, generically then the rank must be greater than mκ for all m < (cid:3) ROFILE MIXTURE MODEL IDENTIFIABILITY 11 Algebraic Aspects of the Profile Mixture Model

Next we relate the algebraic deﬁnitions made in the previous section to phylogeneticmodels and the PM model in particular. We begin by describing how a row tensor productof Markov matrices relates to parameters on a star tree.

Deﬁnition 4.1.

Let A be a set of taxa on a star tree rooted at its internal node, withpendant edges e , . . . , e | A | and associated Markov matrices M e i . Then (4) M A = M e ⊗ r · · · ⊗ r M e | A | . For an m -class PM model on a star tree, the matrix M A is of size mκ × κ | A | . Its entriesare conditional probabilities of observing diﬀerent | A | -tuples of states at the taxa in set A , given the state at the root.Given a tree T on taxa X , tripartitions and splits of X can be associated to thetopological structure of T . For instance, the tree of Figure 1 displays a tripartition A | B | C with A = { a, b, c } , B = { d, f } , C = { g, h } . Formally, a tripartition A | B | C is displayed ona tree if there is some vertex v of T whose deletion results in three subtrees with A, B, C labeling their leaves. Similarly, if A ′ = { a, b, c } and B ′ = { d, f, g, h } , then X = A ′ ⊔ B ′ ,and T displays the split A ′ | B ′ of X , since there is an edge e whose deletion results in twosubtrees with leaves labeled by A ′ and B ′ . hg abd f cv e Figure 1.

A tree displaying the tripartition A | B | C and the split A | B ∪ C ,where A = { a, b, c } , B = { d, f } , C = { g, h } .When a tree T displays a tripartition of a set of taxa, then the ﬂattening of a jointdistribution corresponding to that tripartition can be expressed using the 3-way matrixproduct of certain matrices built from model parameters. Lemma 4.2.

Suppose T is a tree on a set of taxa X rooted at an internal vertex v andthat T displays the tripartition A | B | C associated to v . Let P be a probability distributionfor a Markov model M on T with ℓ states at the internal nodes. Then there exist matrices M A , M B , M C constructed from model parameters for M , each with ℓ rows, such that Flat A | B | C ( P ) = [ M A , M B , M C ] . Proof.

From the parameters on T we may deﬁne Markov matrices M A , M B , M C whoseentries are conditional probabilities of states at the leaves in each set A, B, C , given thestate at v . Let π be the state distribution at v . ThenFlat A | B | C ( P ) = [ π ; M A , M B , M C ] = [ M A , M B , M C ] , where M A = diag( π ) M A , M B = M B , and M C = M C . (cid:3) For establishing generic properties of the PM model, we will often consider the par-ticular choice of the exchangabilities given by the matrix R = whose entries are all 1.This is in essence the CAT-F81 model [Lartillot and Philippe, 2004, Le et al., 2008], withthe number of proﬁles some ﬁxed m . For this R , a Markov matrix has the form given inequation (3) of Deﬁnition 3.10. Lemma 4.3.

Consider the PM model

P M ( T, κ, m ) with R = , and let e be a branchof T of length 1. Then for a single class c with proﬁle π and rate r ≥ , the Markovmatrix M ec = exp( Q c r ) for e is of the form M ( a , . . . , a κ ) of Deﬁnition 3.10, with a i = π i (1 − e − r ) ≥ and s = P κi =1 a i satisfying ≤ s < .Conversely, any κ × κ Markov matrix of the form M = M ( a , . . . , a κ ) with a j ≥ and ≤ s < comes from a choice of parameters for one class of the PM model with R = on an edge of length 1.Provided s = 0 (equivalently r = 0 ), this correspondence is one-to-one.Proof. The ﬁrst statement follows by direct computation: With e j the standard basisvectors, Q c = R diag( π ) − I has right eigenvectors − π j e + π e j with eigenvalues − ≤ j ≤ κ , and eigenvector P κj =1 e j with eigenvalue 0.For the converse, since 0 ≤ s <

1, there is a unique r ≥ s = 1 − e − r . If s > π j = a j /s for j = 1 , · · · , κ , and π = ( π j ). Then P κj =1 π j = 1, and a j = π j (1 − e − r ).With these choices Q = R diag( π ) − I , and M = exp( rQ ). If s = 0, then all the a j arezero, and M is the identity matrix. Take r = 0 and π arbitrary. Then M = exp(0 Q ). (cid:3) Identifiability of Parameters for the Profile Mixture Model

With preliminaries completed, we now turn to establishing our main result, on genericparameter identiﬁability for the PM model. The ﬁrst step is to understand that the ranksof matrix ﬂattenings of a model distribution are aﬀected by whether the associated splitis, or is not, displayed on the tree T . Proposition 5.1.

Let T be an n -taxon tree on X and P a distribution from the modelPM= PM ( T, κ, m ) with κ = 20 and m < . Suppose that A | B is a split of X with | A | , | B | ≥ . (1) If A | B is displayed on T , then Flat A | B ( P ) has rank at most mκ ; (2) If A | B is not displayed on T , then Flat A | B ( P ) generically has rank greater than mκ . Before beginning the proof, we present a simpliﬁed example to illustrate how the matrixrank of ﬂattenings of joint distributions from Markov models on trees carries informationabout the absence/presence of an internal edges on T . Example . Consider a single-class 2-state Markov model on the 4-taxon tree shown inFigure 2. A special case of this model is

P M ( T, , T is the 2 × × × P , with entries p ijkl indexed by leaves in the order a, b, c, d . ROFILE MIXTURE MODEL IDENTIFIABILITY 13 a cdb v v Figure 2.

A 4-taxon tree with split { a, b }|{ c, d } .With A = { a, b } and B = { c, d } , the rows and columns of Flat A | B ( P ) are indexedby elements of [2] × [2]. For example, the ((1 , , (1 , p . In contrast, if A ′ = { a, c } and B ′ = { b, d } , the ﬂattening Flat A ′ | B ′ ( P ) has ((1 , , (1 , p .Now suppose that the terminal edges of T have length 0, so that the states at a and b must agree, as must those at c and d , since no substitutions occur on terminal edges.Then the matrix Flat A | B ( P ) arises from the joint distribution of states at the internalnodes v and v , and its only non-zero entries are p iijj . Thus the matrix ﬂattening for thesplit A | B displayed by T has formFlat A | B ( P ) =  (1 ,

1) (1 ,

2) (2 ,

1) (2 , , p p (1 ,

2) 0 0 0 0(2 ,

1) 0 0 0 0(2 , p p  , with rank at most 2 = mκ .In contrast, the ﬂattening for the split A ′ | B ′ not displayed on T has formFlat A ′ | B ′ ( P ) =  (1 ,

1) (1 ,

2) (2 ,

1) (2 , , p ,

2) 0 p ,

1) 0 0 p ,

2) 0 0 0 p  , which generically has rank 4 = ( mκ ) > mκ .If the terminal edges of T are of positive length, then the resulting joint distribution P can be obtained by a simple and generically rank-preserving linear action on the rowsand columns of the ﬂattenings above. Thus, ﬂattenings respecting the topology of T generically have rank mκ while those that do not generically have larger rank. Proof of Proposition 5.1.

To show claim (1), suppose the split A | B is displayed on T withassociated edge e = ( v A , v B ). Let M A be the mκ × κ | A | matrix and M B the mκ × κ | B | matrixgiving the conditional probabilities of jointly observing states at A and B , conditionedon states at v A and v B respectively. Then, by rooting the tree at v A and letting M e denote the mκ × mκ Markov matrix associated to e , the joint distribution of ( v A , v B ) isdiag( Π ) M e and it follows thatFlat A | B ( P ) = M TA diag( Π ) M e M B . Since rank( M e ) ≤ mκ , it follows that Flat A | B ( P ) has rank at most mκ . For claim (2), suppose now A | B is not displayed on T . Let V be the variety of matrices ofsize κ | A | × κ | B | with rank at most mκ , deﬁned by the set of all ( mκ + 1) × ( mκ + 1) minors.By Proposition 3.9, it suﬃces to ﬁnd a single choice of P M ( T, κ, m ) parameters thatproduces a point oﬀ V , as the parameterization extends to a complex analytic function.Since T does not display A | B , by Theorem 3.8.6 of Semple and Steel [2003], there isan edge e = ( v , v ) of T with associated split C | D such that A ′ = A ∩ C , A ′′ = A ∩ D , B ′ = B ∩ C , B ′′ = B ∩ D are all non-empty. To ﬁnd the needed choice of parameters, ﬁxall internal edges of T except e to have length 0, so the Markov matrices on these edgesare I , and ﬁx the edge lengths of all terminal edges and e to be 1. See Figure 3. Take R = and mixing weights w i = 1 /m to be uniform. Values for the parameters π i , r i willbe speciﬁed later in the argument. For this choice of parameters, T is formed by joiningtwo star trees at the ends of e . v v A ′ B ′ C A ′′ B ′′ De Figure 3.

A tree T which does not display the split A | B , but displays thesplit C | D such that A ′ = A ∩ C , A ′′ = A ∩ D , B ′ = B ∩ C , B ′′ = B ∩ D areall non-empty.Taking r = v to be the root of T , let K = diag( Π ) M e be the mκ × mκ block diagonalmatrix which is the joint distribution of classes and states at v and v . The probabilitiesof observing states i , j , k , l at leaves in A ′ , B ′ , A ′′ , B ′′ respectively, P ( i , j , k , l ), are theentries of a κ | A ′ | × κ | B ′ | × κ | A ′′ | × κ | B ′′ | tensor.Deﬁne a mκ × mκ × mκ × mκ tensor Q , Q ( i, j, k, l ) = ( K ( i, k ) i = j, k = l, . The tensor Q is the joint distribution of states at the leaves of the tree T of Figure 3 whenterminal edges have length zero and A ′ , B ′ , A ′′ , B ′′ are single taxa. Indeed, since A | B is ROFILE MIXTURE MODEL IDENTIFIABILITY 15 not displayed on T , the matrix b Q = Flat A | B ( Q ) is ( mκ ) × ( mκ ) with entries b Q (( i, j ) , ( k, l )) = Q ( i, k, j, l ) . Since K is block diagonal, b Q has at most mκ nonzero entries, all appearing on thediagonal, and b Q is generically of rank mκ .To see that in the general case Flat A | B ( P ) has a similar structure, let N A = M A ′ ⊗ M A ′′ and N B = M B ′ ⊗ M B ′′ where M A ′ , M A ′′ , M B ′ , M B ′′ are given as in equation (4) of Deﬁnition4.1. Then(5) Flat A | B ( P ) = N TA b Q N B . (a) v v e A ′′ B ′′ A ′ B ′ (b) v v e A ′′ B ′′ A ′ B ′ Figure 4.

Trees with (a) | A ′ | = | B ′ | = 2 and | A ′′ | = | B ′′ | = 1, and (b) | A ′ | = | B ′′ | = 2 and | A ′′ | = | B ′ | = 1 .We now establish that claim (2) holds when | A | = | B | = 3, so the tree is one of thoseshown in Figure 4. Suppose ﬁrst that | A ′ | = | B ′ | = 2 and | A ′′ | = | B ′′ | = 1, as shownfor tree (a) of the ﬁgure. In this case N A = N B . Since b Q is diagonal with at most mκ non-zero entries due to the block structure of K , in equation (5) we can replace b Q by adiagonal mκ × mκ matrix Q by eliminating zero rows and columns. To do this, we mustalso replace N A = N B with an mκ × κ matrix N formed by taking tensor products ofthe individual class components of M A ′ = M ⊗ r and M A ′′ = M and then restacking. Tobe concrete, for class c the Markov matrix for a terminal edge is M c = M ( a c , . . . , a cκ ) byLemma 4.3, and N is formed by stacking m matrices ( M c ) ⊗ r ⊗ M c .Since Q is diagonal with generically positive entries, using equation (5) we have thatFlat A | B ( P ) = (cid:0) N T Q / (cid:1) (cid:0) Q / N (cid:1) = Λ T Λ , where Λ = Q / N . By the singular value decomposition, it follows thatrank(Λ T Λ) = rank(Λ) = rank( N ) . The

Pari/GP calculation presented in Proposition 3.13, together with Proposition 3.9show that rank( N ) > mκ generically, and thus for generic π i and r i it follows thatrank(Flat A | B ( P )) > mκ .Now continuing with | A | = | B | = 3 suppose that | A ′ | = | B ′′ | = 2 and | A ′′ | = | B ′ | = 1, asshown by Figure 4(b). The previous argument fails for this tree because now N A = N B , asthe tensor products deﬁning these matrices, are taken in diﬀerent orders. However, a morecomplicated Pari/GP calculation, presented as Proposition 3.14, shows that Flat A | B ( P )generically has rank greater than mκ in this case. Finally, for the general case of | A | , | B | ≥

3, take b A to be a 3-element subset of A with atleast one element from A ′ and one from A ′′ , and similarly take b B to be a 3-element subsetof B with at least one element from B ′ and from B ′′ . Let b P be the probability distributionfor the taxa b A ∪ b B . Since the row indices of Flat A | B ( P ) depend on the states at the taxain A and the column indices depend on the states at the taxa in B , marginalizing overall possible states for the taxa in A which are not in b A , and similarly for B , gives thematrix Flat b A | b B ( b P ). There exist matrices, J , J which perform this marginalization onFlat A | B ( P ), J Flat A | B ( P ) J = Flat b A | b B ( b P ) . Since Flat b A | b B ( b P ) generically has rank greater than mκ and Flat A | B ( P ) has rank greaterthan or equal to Flat b A | b B ( b P ) by this equation, it follows that Flat A | B ( P ) generically hasrank greater than mκ . (cid:3) As a consequence of Proposition 5.1, from a distribution P computed from generic PMmodel parameters we can identify every edge in the tree for which there are at least threetaxa on either side, by computing ranks of ﬂattenings of P . In the following, we see thatProposition 5.1 also helps to identify at least one tripartition on the tree. Proposition 5.2.

Let T be an n -taxon tree on X with n ≥ , and P a joint distributionfrom generic parameters for the model P M ( T, κ, m ) with κ = 20 and m < . Then thereis at least one tripartition A | B | C displayed on T , with | A | , | B | ≥ , which can be identiﬁedfrom P .Proof. By Lemma 4 . T with n ≥ v which induces a tripartition A | B | C such that two of thethree components contain at least ⌈ n/ ⌉ leaves of T .The two edges incident to v that correspond to subsets of X with at least ⌈ n/ ⌉ leavesare generically identiﬁable by Proposition 5.1, since for n ≥ ⌈ n/ ⌉ ≥

3. If the thirdedge incident to v has 3 or more taxa in its component, it also can be identiﬁed. Thus, itremains to establish that the third edge incident to v can be identiﬁed when the numberof taxa in its component is 1 or 2. Examples of such trees are illustrated for n = 9 inFigure 5.( a ) ( b ) Figure 5.

Examples of 9-taxon trees with internal vertex v inducing A | B | C with | A | , | B | ≥ | C | = 1 or 2. ROFILE MIXTURE MODEL IDENTIFIABILITY 17

If the third component has only one leaf, as in Figure 5(a), the two bipartitions A ∪{ c }| B and A | B ∪ { c } are identiﬁable by Proposition 5.1. Together this implies that thetripartition induced by v is A | B |{ c } . If the third component has two leaves as in Figure 5(b), the two splits A ∪ { c , c }| B and A | B ∪ { c , c } are identiﬁable, but A ∪ { c }| B ∪ { c } and A ∪ { c }| B ∪ { c } are not displayed on T , and that can be detected by Proposition5.1. This implies the tripartition A | B |{ c , c } is on the tree. (cid:3) With a tripartition on the tree identiﬁable by the preceding proposition, we prepare toapply Kruskal’s Theorem. Letting P be a joint distribution from P M ( T, κ, m ), pick aninternal vertex v of T inducing such a tripartition A | B | C . Then by Lemma 4.2Flat A | B | C ( P ) = [ π ; M A , M B , M C ] = [ M A , M B , M C ] , where M A = diag( Π ) M A . Provided the Kruskal ranks of the matrices M A , M B , M C arelarge enough, at least generically, Kruskal’s theorem can be applied. The next threelemmas establish this. Lemma 5.3.

Consider the model

P M ( T, , m ) with m ≤ . If ℓ ≥ , then the ℓ th rowtensor power of the mκ × κ Markov matrix associated to a terminal edge of T has full rowrank for generic parameters.Proof. Using Proposition 3.9, it is enough to show there is a single choice of parametersfor which the tensor power has full row rank. Let R = , and take the terminal branchlengths to be 1. Then by Lemma 4.3 the Markov matrix M e on a terminal edge hasthe form of stacked matrices of the form M ( a , . . . , a κ ). By the Pari/GP calculation ofProposition 3.11, for generic choices of the other parameters, M ⊗ ℓr e , ℓ ≥

3, has full rowrank. (cid:3)

Using Proposition 3.12 in a similar argument we obtain the following.

Lemma 5.4.

Consider the model

P M ( T, κ, m ) with κ ≥ and m ≥ . Then for ℓ ≥ ,the ℓ th row tensor power of the mκ × κ Markov matrix associated to a terminal edge of T generically has Kruskal rank at least 2. Lemma 5.5.

For a distribution from the model

P M ( T, κ, m ) with κ = 20 and m ≤ ,let M A , M B , M C be the matrices described above. If | A | , | B | ≥ , and | C | ≥ , thengenerically M A , M B have full Kruskal rank and M C has Kruskal rank at least .Proof. Using Proposition 3.9, we need only show there is a single choice of parameters forwhich these rank claims hold. Set all internal branch lengths 0 and all terminal branchlengths 1, so that T is a star tree rooted at the central node v . Then by Lemma 5.3, since | A | , | B | ≥ π i the matrices M A (and therefore M A )and M B have full row rank and therefore full Kruskal rank. Also by Lemma 5.4, M C hasKruskal rank at least 2. (cid:3) We add the last ingredient before the main result.

Proposition 5.6.

Suppose T is a tree on X which displays a known tripartition A | B | C corresponding to vertex r with | A | , | B | ≥ , | C | ≥ . If κ = 20 and m ≤ then both T and the numerical parameters of the PM ( T, κ, m ) model are generically identiﬁable, up toarbitrary rescaling of the tree and the exchangeability matrix R .Proof. Using the notation and result of Lemma 5.5, if a distribution P comes from genericparameters of P M ( T, κ, m ), thenFlat A | B | C ( P ) = [ M A , M B , M C ] , where M A , M B have full Kruskal rank and M C has Kruskal rank at least 2. Thus equation(2) of Theorem 3.6 is satisﬁed with l = mκ , and M A , M B , M C are determined uniquelyup to simultaneous permutation and scaling of the rows.Also, by factoring out row sums from the matrices, we can generically identify the rootdistribution vector Π at the node r and M A , M B , M C up to simultaneous permutation ofthe entries of Π and the rows of the matrices. Considering any entry of Π , and supposingthat this corresponds to an unknown class u ∈ [ m ] and state w ∈ [ κ ], then the same rowsof M A , M B , M C correspond to the same class u and state w . Since Kruskal’s theoremyields identiﬁability only up to permutation, we must determine which of the mκ rows of M A , M B , M C correspond to the same ﬁxed class u .Consider ﬁrst the special case that | A | = 3 where A = { a, b, c } . Then T , which isgenerically binary, has a subtrees rooted at r , with leaves A = { x, y, z } as shown inFigure 6, though we do not know which two taxa from a, b, c form the cherry { y, z } . y z xv v rM y M z M x M M Figure 6.

A subtree of T with leaves A = { a, b, c } = { x, y, z } .The Markov matrix M A is of size mκ × κ . Choose the ℓ th row of M A where ℓ = ( u, w )for unknown u, w . It is a row vector with κ entries, but we can reconﬁgure it as a 3-dimensional tensor of size κ × κ × κ so its ( i, j, k )-entry is P ( a = i, b = j, c = k | r = ℓ ).Since the PM model is time reversible, take v as the root of the subtree in Figure 6.Then for unknown 1 × κ vector π v , and κ × κ Markov matrices M x , M y , M z , M , M for ROFILE MIXTURE MODEL IDENTIFIABILITY 19 class u on this subtree, the joint distribution of states at x, y, z, r for ﬁxed class u is P ( x = i, y = j, z = k, r = ( u, w ))= κ X α =1 κ X β =1 π v ( β ) M y ( β, j ) M z ( β, k ) M ( β, α ) M ( α, w ) M x ( α, i )= κ X β =1 π v ( β ) M y ( β, j ) M z ( β, k ) κ X α =1 M ( β, α ) M ( α, w ) M x ( α, i ) ! = κ X β =1 π v ( β ) M y ( β, j ) M z ( β, k ) c M ( u,w ) ( β, i ) = [ π v ; M y , M z , c M ( u,w ) ] , where c M ( u,w ) = M diag( M ( · , w )) M x with M ( · , w ) denoting the w th column of M . Forﬁxed u this is simply a rescaling of the conditional distribution P ( x = i, y = j, z = k | r = ( u, w )) given in the ℓ th row of M A .Thus applying Kruskal’s theorem to each row of M A reshaped into such a 3-way tensor,we can decompose P ( x = i, y = j, z = k | r = ℓ ) for each ℓ = ( u, w ) into a tripleproduct, as the matrices generically all have rank κ . Note that for each ℓ = ( u, w ),Kruskal’s theorem gives the matrices M y , M z , c M ( u,w ) up to ordering of their κ rows. Twoof these matrices, M y , M z , will be dependent only on the class u , but not the state w . Soconsidering all ℓ = ( u, w ), we can ﬁnd κ rows of M A with the same (possibly permutedrows) version of M y and M z which correspond to a single class u . In this way we cangroup the rows of M A , M B , M C with entries of Π by class u . Now taking those rows of M A , M B , M C , and entries of Π for one class u and reassembling them in a 3-way productgives a tensor for a single class GTR model on the full tree T . Both the tree T andnumerical parameters are identiﬁable for this single-class model by Theorem 2.4.For the general case, suppose | A | , | B | ≥

3. Then by marginalization down to | A | = 3we can identify the subtrees and parameters for B, C . Then interchanging the roles of A and B identiﬁes the subtree and parameters for A . (cid:3) Combining Proposition 5.2 with Proposition 5.6, we have proved the main result.

Theorem 5.7.

Let T be a tree with at least taxa. Then under the PM ( T, , m ) modelwith m < , both T and numerical parameters are generically identiﬁable, up to arbitraryrescaling of the tree and the exchangeability matrix R . Theorem 5.7 extends to certain tree shapes with fewer than 9 taxa. To apply Proposition5.6, T must display a tripartition with two of its subsets of size at least 3, so that T musthave at least 7 taxa. Such a tripartition will be generically identiﬁable by the argumentgiven for Proposition 5.2. Corollary 5.8.

For the proﬁle mixture model

P M ( T, , m ) with m < , parameters aregenerically identiﬁable if T has any of the -taxon tree shapes (a)-(d) shown in Figure 7,or the -taxon caterpillar shape. ( a ) ( b ) ( c )( d ) ( e ) Figure 7.

All binary unrooted tree shapes for 8 taxa. Parameters of thePM model are generically identiﬁable for trees (a)-(d). The arguments ofthis paper do not answer the identiﬁability question for tree (e).

Acknowledgments

This research was supported, in part, by the National Institutes of Health Grant R01GM117590, awarded under the Joint DMS/NIGMS Initiative to Support Research at theInterface of the Biological and Mathematical Sciences.

Author Disclosure Statement

No competing ﬁnancial interests exist.

References

E.S. Allman and J.A. Rhodes. The identiﬁability of tree topology for phylogenetic models,including covarion and mixture models.

J. Comput. Biol. , 13:1101–1113, 2006. doi:10.1007/s00285-010-0355-7.E.S. Allman and J.A. Rhodes. Identifying evolutionary trees and substitution parametersfor the general Markov model with invariable sites.

Math. Biosci. , 211(1):18–33, 2008.E.S. Allman and J.A Rhodes. The identiﬁability of covarion models in phylogenetics.

IEEE/ACM Trans. Comput. Biol. Bioinform. , 6(1):76–88, 2009.E.S. Allman, C. An´e, and J.A. Rhodes. Identiﬁability of a Markovian model of molecularevolution with gamma-distributed rates.

Adv. in Appl. Probab. , 40:229–249, 2008.E.S. Allman, M.T. Holder, and J.A. Rhodes. Estimating trees from ﬁltered data: Identi-ﬁability of models for morphological phylogenetics.

J. Theor. Biol. , 263:108–119, 2010.E.S. Allman, S. Petrovi´c, J.A. Rhodes, and S. Sullivant. Identiﬁability of two-tree mixturesfor group-based models.

IEEE/ACM Trans. Comput. Biol. Bioinform. , 8(3):710–722,2011.E.S. Allman, C. Long, and J.A. Rhodes. Species tree inference from genomic sequencesusing the log-det distance.

SIAM J. Appl. Algebra Geometry , 3(1):1–30, 2019.J. Chai and E.A. Housworth. On Rogers’s Proof of Identiﬁability for the GTR + Gamma+ I Model.

Syst. Biol. , 60(5):713–718, 2011.

ROFILE MIXTURE MODEL IDENTIFIABILITY 21

J. Chifman and L. Kubatko. Identiﬁability of the unrooted species tree topology under thecoalescent model with time-reversible substitution processes, site-speciﬁc rate variation,and invariable sites.

J. Theor. Biol. , 374:35–47, 2015.B. Hollering and S. Sullivant. Identiﬁability in phylogenetics using algebraic matroids. arXiv:1909.13754 , 2019.D.T. Jones, W.R. Taylor, and J.M. Thornton. The rapid generation of mutation datamatrices from protein sequences.

Comput. Appl. Biosci. , 8(3):275–82, 1992. doi: 10.1093/bioinformatics/8.3.275.J.B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, withapplication to arithmetic complexity and statistics.

Linear Algebra Appl. , 18(2):95–138,1977. doi: 10.1016/0024-3795(77)90069-6.N. Lartillot and H. Philippe. A Bayesian mixture model for across-site heterogeneities inthe amino-acid replacement process.

Mol. Bio. Evol. , 21:1095–1109, 2004.N. Lartillot, H. Brinkmann, and H. Philippe. Suppression of long-branch attraction arte-facts in the animal phylogeny using a site heterogeneous model.

BMC Evol. Biol. , 7:S4, 2007.N. Lartillot, T. Lepage, and S. Blanquart. PhyloBayes 3: a Bayesian software package forphylogenetic reconstruction and molecular dating.

Bioinformatics , 25:2286–2288, 2009.N. Lartillot, N. Rodrigue, Dl Stubbs, and J. Richer. PhyloBayes MPI: Phylogeneticreconstruction with inﬁnite mixtures of proﬁles in a parallel environment.

Syst. Biol. ,62(4):611–615, 2013.S.Q. Le, O. Gascuel, and N. Lartillot. Empirical proﬁle mixture models for phylogeneticreconstruction.

Bioinformatics , 24(20):2317–2323, 08 2008. ISSN 1367-4803. doi: 10.1093/bioinformatics/btn445.C. Long and S. Sullivant. Identiﬁability of 3-class Jukes-Cantor mixtures.

Adv. in Appl.Math. , 64:89–110, 2015. ISSN 0196-8858. doi: 10.1016/j.aam.2014.12.003.R.M. Range.

Holomorphic Functions and Integral Representations in Several ComplexVariables , volume 108 of

Graduate Texts in Mathematics . Springer-Verlag, New York,1986.B. Rannala. Identiﬁability of parameters in MCMC Bayesian inference of phylogeny.

Syst.Biol. , 51(5):754–760, 2002. doi: 10.1080/10635150290102429.J.A. Rhodes and S. Sullivant. Identiﬁability of large phylogenetic mixture models.

Bul.Math. Biol. , 74:212–231, 2012. doi: 10.1007/s11538-011-9672-2.C. Semple and M. Steel.

Phylogenetics , volume 24 of

Oxford Lecture Series in Mathematicsand its Applications . Oxford University Press, Oxford, 2003.The PARI Group.

PARI/GP version . Univ. Bordeaux, 2019. URL http://pari.math.u-bordeaux.fr/ .H.-C. Wang, K. Li, E. Susko, and A.J. Roger. A class frequency mixture model thatadjusts for site-speciﬁc amino acid frequencies and improves inference of protein phy-logeny.

BMC Evol. Biol. , 8(331):1–13, 2008. doi: 10.1186/1471-2148-8-331.H.-C. Wang, E. Susko, and A.J. Roger. An amino acid substitution-selection modeladjusts residue ﬁtness to improve phylogenetic estimation.

Mol. Biol. Evol. , 31(4):779–792, 2014.

M. Wascher and L. Kubatko. Consistency of SVDQuartets and maximum likelihood forcoalescent-based species tree estimation.

Syst. Biol. , 2020. in press.S. Whelan and N. Goldman. A general empirical model of protein evolution derived frommultiple protein families using a maximum-likelihood approach.

Mol. Bio. Evol. , 18(5):691–9, 2001. doi: 10.1093/oxfordjournals.molbev.a003851.Z. Yang. Maximum likelihood phylogenetic estimation from DNA sequences with variablerates over sites: approximate methods.

J. Mol. Evol. , 39:306–14, 1994.

Department of Mathematics and Statistics, University of Alaska Fairbanks, 99775

E-mail address : [email protected] Department of Mathematics and Statistics, University of Alaska Fairbanks, 99775

E-mail address ::