Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci
aa r X i v : . [ q - b i o . P E ] N ov Incomplete Lineage Sorting: Consistent PhylogenyEstimation From Multiple Loci ∗ Elchanan Mossel
Department of StatisticsUniversity of California, Berkeley [email protected]
Sebastien Roch
Theory GroupMicrosoft Research
May 31, 2018
Abstract
We introduce a simple algorithm for reconstructing phylogenies frommultiple gene trees in the presence of incomplete lineage sorting, that is,when the topology of the gene trees may differ from that of the species tree.We show that our technique is statistically consistent under standard stochas-tic assumptions, that is, it returns the correct tree given sufficiently manyunlinked loci. We also show that it can tolerate moderate estimation errors.
Phylogenies—the evolutionary relationships of a group of species—are typicallyinferred from estimated genealogical histories of one or several genes (or genetrees ) [Fel04, SS03]. Yet it is well known that such gene trees may provide mis-leading information about the phylogeny (or species tree ) containing them. In-deed, it was observed early on that a gene tree may be topologically inconsistentwith its species tree, a phenomenon known as incomplete lineage sorting . Seee.g. [Mad97, Nic01, Fel04] and references therein. Such discordance plays littlerole in the reconstruction of deep phylogenetic branchings but it is critical in thestudy of recently diverged populations [LP02, HM03, Kno04].Two common approaches to deal with this issue are concatenation and majorityvoting . In the former, one concatenates the sequences originating from several ∗ Keywords: incomplete lineage sorting, gene tree, species tree, coalescent, topological concor-dance, statistical consistency. E.M. is supported by an Alfred Sloan fellowship in Mathematics andby NSF grants DMS-0528488, and DMS-0548249 (CAREER) and by ONR grant N0014-07-1-05-06. most likely gene tree may be inconsistent with the species tree; and this situationmay arise on any topology with at least species. See also [PN88, Tak89] forrelated results.Other techniques are being explored that attempt to address incomplete lineagesorting, notably Bayesian [ELP07] and likelihood [SR07] methods. However theproblem is still far from being solved as discussed in [MK06]. Here we propose asimple technique—which we call Global LAteSt Split or GLASS—for estimatingspecies trees from multiple genes (or loci ). Our technique develops some of theideas of Takahata [Tak89] and Rosenberg [Ros02] who studied the properties ofgene trees in terms of the corresponding species tree. In our main result, we showthat GLASS is statistically consistent , that is, it always returns the correct topologygiven sufficiently many (unlinked) genes—thereby avoiding the pitfalls highlightedin [DR06]. We also obtain explicit convergence rates under a standard model basedon Kingman’s coalescent [Kin82]. Moreover, we allow the use of several allelesfrom each population and we show how our technique leads to an extension ofRosenberg’s topological concordance [Ros02] to multiple loci.We note the recent results of Steel and Rodrigo [SR07] who showed that Max-imum Likelihood (ML) is statistically consistent under slightly different assump-tions. An advantage of GLASS over likelihood (and Bayesian) methods is its com-putational efficiency, as no efficient algorithm for finding ML trees is known. Fur-thermore, GLASS gives explicit convergence rates—useful in assessing the qualityof the reconstruction.For more background on phylogenetic inference and coalescent theory, seee.g. [Fel04, SS03, HSW05, Nor01, Tav04]. Organization.
The rest of the paper is organized as follows. We begin in Sec-tion 2 with a description of the basic setup. The GLASS method is introduced inSection 3. A proof of its consistency can be found in Sections 4 and 5. We showin Section 6 that GLASS remains consistent under moderate estimation errors. Fi-nally in Section 7 we do away with the molecular clock assumption and we showhow our technique can be used in conjunction with any distance matrix method.2
Basic Setup
We introduce our basic modelling assumptions. See e.g. [DR06].
Species tree.
Consider n isolated populations with a common evolutionary his-tory given by the species tree S = ( V, E ) with leaf set L . Note that | L | = n . Foreach branch e of S , we denote: • N e , the (haploid) population size on e (we assume that the population sizeremains constant along the branch); • t e , the number of generations encountered on e ; • τ e = t e N e , the length of e in standard coalescent time units; • µ = min e τ e , the shortest branch length in S .The model does not allow migration between contemporaneous populations. Oftenin the literature, the population sizes { N e } e ∈ E , are taken to be equal to a constant N . Our results are valid in a more general setting. Gene trees.
We consider k loci I . For each population l and each locus i , wesample a set of alleles M ( i ) l . Each locus i ∈ I has a genealogical history repre-sented by a gene tree G ( i ) = ( V ( i ) , E ( i ) ) with leaf set L ( i ) = ∪ l M ( i ) l . For twoleaves a, b in G ( i ) , we let D ( i ) ab be the time in number of generations to the most re-cent common ancestor of a and b in G ( i ) . Following [Tak89, Ros02] we are actuallyinterested in interspecific coalescence times. Hence, we define, for all r, s ∈ L , D ( i ) rs = min n D ( i ) ab : a ∈ M ( i ) r , b ∈ M ( i ) s o . Inference problem.
We seek to solve the following inference problem. We aregiven k gene trees as above, including accurate estimates of the coalescence times (cid:26)(cid:16) D ( i ) ab (cid:17) a,b ∈L ( i ) (cid:27) i ∈I . Our goal is to infer the species tree S . 3 tochastic Model. In Section 4, we will first state the correcteness of our infer-ence algorithm in terms of a combinatorial property of the gene trees. In Section 5,we will then show that under the following standard stochastic assumptions, thisproperty holds for a moderate number of genes.Namely, we will assume that each gene tree G ( i ) is distributed according to astandard coalescent process : looking backwards in time, in each branch any twoalleles coalesce at exponential rate 1 independently of all other pairs; whenevertwo populations merge in the species tree, we also merge the allele sets of thecorresponding populations (that is, the coalescence proceeds on the union of bothallele sets). We further assume that the k loci I are unlinked or in other words thatthe gene trees {G ( i ) } i ∈I are mutually independent.Under these assumptions, an inference algorithm is said to be statistically con-sistent if the probability of returning an incorrect reconstruction goes to 0 as k tends to + ∞ . We introduce a technique which we call the Global LAteSt Split (GLASS) method.
Inference method.
Consider first the case of a single gene ( k = 1 ). Lookingbackwards in time, the first speciation occurs at some time T , say between popu-lations r and s . It is well known that, for any sample a from M (1) r and b from M (1) s , the coalescence time D (1) ab between alleles a and b overestimates the diver-gence time of the populations. As noted in [Tak89], a better estimate of T can beobtained by taking the smallest interspecific coalescence time between alleles in M (1) r and in M (1) s , that is, by considering instead D (1) r s .The inference then proceeds as follows. First, cluster the two populations, say r and s , with smallest interspecific coalescence time D (1) r s . Define the coales-cence time of two clusters A, B ⊆ L as the minimum interspecific coalescencetime between populations in A and in B , that is, D (1) AB = min n D (1) rs : r ∈ A, s ∈ B o . Then, repeat as above until there is only one cluster left. This is essentially thealgorithm proposed by Rosenberg [Ros02]. In particular, Rosenberg calls the im-plied topology on the populations so obtained the collapsed gene tree .How to extend this algorithm to k > ? As we discussed earlier, one could infera gene tree as above for each locus and take a majority vote—but this approachfails [DR06]; in particular, it is generally not statistically consistent.4nother natural idea is to get a “better” estimate of coalescence times by av-eraging across loci. This leads to the Shallowest Divergence Clustering method ofMaddison and Knowles [MK06]. We argue that a better choice is, instead, to takethe minimum across loci. In other words, we apply the clustering algorithm aboveto the quantity D AB = min n D ( i ) AB : i ∈ I o , for all A, B ⊆ L with A ∩ B = ∅ . The reason we consider the minimum is similarto the case of one locus and several samples per population above: it suffices tohave one pair a ∈ M ( i ) r , b ∈ M ( i ) s (for some i ) with coalescence time T across all pairs of samples in populations r and s (one from each) and all loci in I toprovide indisputable evidence that the corresponding species branch before time T (looking backwards in time). In a sense, we build the “minimal” tree on L that is“consistent” with the evidence provided by the gene trees. This type of approachis briefly discussed by Takahata [Tak89] in the simple case of three populations(where the issues raised by [DR06] do not arise).The algorithm, which we name GLASS, is detailed in Figure 1. We call thetree so obtained the glass tree . We show in the next section that GLASS is in factstatistically consistent. Algorithm
GLASS
Input:
Gene trees {G ( i ) } i ∈I and coalescence times D ( i ) ab for all i ∈ I and a, b ∈ L ( i ) ; Output:
Estimated topology S ′ ; • [Intercluster coalescences] For all
A, B ⊆ L with A ∩ B = ∅ , compute D AB = min n D ( i ) ab : i ∈ I , r ∈ A, s ∈ B, a ∈ M ( i ) r , b ∈ M ( i ) s o ; • [Clustering] Set Q := {{ r } : r ∈ L } ; Until | Q | = 1 : – Denote the current partition Q = { A , . . . , A z } ; – Let A ′ , A ′′ minimize D AB over all pairs A, B ∈ Q (break ties arbitrarily); – Merge A ′ and A ′′ in Q ; • [Output] Return the topology implied by the steps above.
Figure 1: Algorithm GLASS.
Multilocus concordance.
A gene tree with one sample per population is said tobe concordant (sometimes also “congruent” or “consistent”) with a species tree if5heir (leaf-labelled) topologies agree. When the number of samples per popula-tion is larger than one, one cannot directly compare the topology of the gene treewith that of the species tree since they contain a different number of leaves. In-stead, Rosenberg [Ros02] defines a gene tree to be topologically concordant witha species tree if the collapsed gene tree (see above) coincides with the species tree.We extend Rosenberg’s definition to multiple loci. We say that a collectionof gene trees {G ( i ) } i ∈I is multilocus concordant with a species tree S if the glasstree agrees with the species tree. Therefore, to prove that GLASS is statisticallyconsistent, it suffices to show that the probability of multilocus concordance goesto as the number of loci goes to + ∞ . In this section, we state a simple combinatorial condition guaranteeing that GLASSreturns the correct species tree. Our condition is an extension of Takahata’s condi-tion in the case of a single gene [Tak89]. See also [Ros02].As before, let S be a species tree and {G ( i ) } i ∈ I a collection of gene trees. For asubset of leaves A ⊆ L , denote by h A i the most recent common ancestor (MRCA)of A in S . For a (internal or leaf) node v in S , we use the following notation: • ⌊ v ⌋ are the descendants of v in L ; • t v is the time elapsed in number of generations between v and ⌊ v ⌋ ; • t v is the time elapsed in number of generations between the immediate an-cestor of v and ⌊ v ⌋ .In particular, note that if e is the branch immediately above v , then we have t e = t v − t v . Also, we call the subtree below v , clade v .Our combinatorial condition can be stated as follows: ( ⋆ ) ∀ u, v ∈ V, t h⌊ u ⌋∪⌊ v ⌋i ≤ D ⌊ u ⌋⌊ v ⌋ < t h⌊ u ⌋∪⌊ v ⌋i . In words, for any two clades u , v , there is at least one locus i and one pair ofalleles a, b with a from clade u and b from clade v such that the lineages of a and b coalesce before the end of the branch above the MRCA of u and v . (Thefirst inequality is clear by construction.) By the next proposition, condition ( ⋆ ) is sufficient for multilocus concordance. Note, however, that it is not necessary.Nevertheless note that, by design, GLASS always returns a tree, even when thecondition is not satisfied. 6 roposition 1 (Sufficient Condition) Assume that ( ⋆ ) is satisfied. Then, GLASSreturns the correct species tree. In other words, the gene trees {G ( i ) } i ∈I are mul-tilocus concordant with the species tree S . Proof:
Let Q be one of the partitions obtained by GLASS along its execution andlet B be the newly created set in Q . We claim that, under ( ⋆ ) , it must be the casethat B = ⌊h B i⌋ . (1)That is, B is the set of leaves of a clade in the species tree S . The propositionfollows immediately from this claim.We prove the claim by induction on the execution time of the algorithm. Prop-erty (1) is trivially true initially. Assume the claim holds up to time T and let Q ,as above, be the partition at time T + 1 . Note that B is obtained by merging twosets B ′ and B ′′ forming a partition of B . By induction, B ′ and B ′′ satisfy (1).Now, suppose by contradiction that B does not satisfy (1). Let h B i ւ and h B i ց be the clades immediately below h B i with corresponding leaf sets C ′ = ⌊h B i ւ ⌋ and C ′′ = ⌊h B i ց ⌋ . By our induction hypothesis, each of B ′ and B ′′ must be con-tained in one of C ′ or C ′′ . Say B ′ ⊆ C ′ and B ′′ ⊆ C ′′ without loss of generality.Moreover, since B does not satisfy (1), one of the inclusions is strict, say B ′ ⊂ C ′ .But by ( ⋆ ) , any set X in Q containing an element of C ′ − B ′ has D B ′ X < t h B ′ ∪ X i ≤ t h B i ւ = t h B i = t h B ′ ∪ B ′′ i ≤ D B ′ B ′′ . (2)To justify the first two inequalities above, note that X is contained in the partitionat time T and therefore satisfies (1). In particular, by construction B ′ ∪ X ⊆ C ′ . Hence by (2), GLASS would not have merged B ′ and B ′′ , a contradiction. (cid:4) In this section, we prove the consistency of GLASS.
Consistency.
We prove the following consistency result. Note that the theoremholds for any species tree—including the “anomaly zone” of Degnan and Rosen-berg [DR06].
Proposition 2 (Consistency)
GLASS is statistically consistent. roof: Throughout the proof, time runs backwards as is conventional in coales-cent theory. We use Proposition 1 and give a lower bound on the probability thatcondition ( ⋆ ) is satisfied.Consider first the case of one locus and one sample per population. By ( ⋆ ) ,the reconstruction is correct if every time two populations meet, the correspond-ing alleles coalesce before the end of the branch immediately above. By classicalcoalescent calculations (e.g. [Tav84]), this happens with probability at least (1 − e − µ ) n − , where we used the fact that there are n − divergences.Now consider the general case. Imagine running the coalescent processes of allloci simultaneously . Consider any branching between two populations. In everygene tree separately, if several alleles emerge on either sides of the branching,choose arbitrarily one allele from each side. The probability that the chosen allelepairs fail to coalesce before the end of the branch above in all loci is at most e − kµ by independence. Indeed, irrespective of everything else going on, two alleles meetat exponential rate 1 (conditionally on the past). This finally gives a probability ofsuccess of at least (1 − e − kµ ) n − . For n and µ fixed, we get (1 − e − kµ ) n − → , as k → + ∞ , as desired. (cid:4) Rates.
Implicit in the proof of Proposition 2 is the following convergence rate.
Proposition 3 (Rate)
It holds that P [ Multilocus Discordance ] ≤ ( n − e − µ ] k . In particular, for any ε > , taking k = 1 µ ln (cid:18) n − ε (cid:19) , we get P [ Multilocus Discordance ] ≤ ε. Proof:
Note that − (1 − e − kµ ) n − ≤ ( n − e − µ ] k . (cid:4) ultiple alleles v. multiple loci. It is interesting to compare the relative effectsof adding more alleles or more loci on the accuracy of the reconstruction. Theresult in Proposition 3 does not address this question. In fact, it is hard to obtainuseful analytic expressions for small numbers of genes and alleles. However, theasymptotic behavior is quite clear. Indeed, as was pointed out in [Ros02] (seealso [MK06] for empirical evidence), the benefit of adding more alleles eventuallywears out. This is because the probability of observing any given number of allelesat the top of a branch is uniformly bounded in the number alleles existing at thebottom. More precisely, we have the following result which is to be contrastedwith Proposition 3.
Proposition 4 (Multiple Alleles: Saturation Effect)
Let S be any species tree on n populations. Then, there is a < q ∗ < (depending only on S ) such that forany number of loci k > and any number of alleles sampled per population, wehave P [ Multilocus Discordance ] ≥ ( q ∗ ) k > . In particular, for a fixed number of loci k > , as the number of alleles per pop-ulaiton goes to + ∞ , the probability that GLASS correctly reconstructs S remainsbounded away from . Proof:
Take any three populations a, b, c from S . Assume that a and b meet T generations back and that c joins them T generations later. For w = a, b, c and i ∈ I , let Y ( i ) w be the event that in locus i there is only one allele remaining at thetop of the branch immediately above w . Let Z ( i ) be the event that the topology ofgene tree i restricted to { a, b, c } is topologically discordant with S . It follows frombound (6.5) in [Tav84] that there is < q ′ < independent of h such that P [ Y ( i ) w ] ≥ q ′ , for all i ∈ I and w ∈ { a, b, c } . Also, it is clear that there is < q ′′ < dependingon T such that P [ Z ( i ) | Y ( i ) w , ∀ w ∈ { a, b, c } ] ≥ q ′′ , for all i ∈ I . Therefore, by independence of the loci, we have P [ Multilocus Discordance ] ≥ Y i ∈I P [ Z ( i ) | Y ( i ) w , ∀ w ∈ { a, b, c } ] Y w ∈{ a,b,c } P [ Y ( i ) w ] ≥ (( q ′ ) q ′′ ) k . Take q ∗ = ( q ′ ) q ′′ . That concludes the proof. (cid:4) Tolerance to Estimation Error
The results of the previous section are somewhat unrealistic in that they assumethat GLASS is given exact estimates of coalescence times. In this section, we relaxthis assumption.Assume that the input to the algorithm is now a set of estimated coalescencetimes (cid:26)(cid:16) b D ( i ) ab (cid:17) a,b ∈L ( i ) (cid:27) i ∈I , and, for all A, B ⊆ L , let b D AB = min n b D ( i ) ab : i ∈ I , r ∈ A, s ∈ B, a ∈ M ( i ) r , b ∈ M ( i ) s o , be the corresponding estimated intercluster coalescence times computed by GLASS.Assume further that there is a δ > such that (cid:12)(cid:12)(cid:12) b D ( i ) ab − D ( i ) ab (cid:12)(cid:12)(cid:12) ≤ δ, for all i ∈ I and a, b ∈ L ( i ) . In particular, note that (cid:12)(cid:12)(cid:12) b D AB − D AB (cid:12)(cid:12)(cid:12) ≤ δ, for all A, B ⊆ L .Let m be the shortest branch length in number of generations, that is, m = min { t e : e ∈ E } . We extend our combinatorial condition ( ⋆ ) to (ˆ ⋆ ) ∀ u, v ∈ V, t h⌊ u ⌋∪⌊ v ⌋i ≤ D ⌊ u ⌋⌊ v ⌋ < t h⌊ u ⌋∪⌊ v ⌋i − δ. Then, we get the following.
Proposition 5 (Sufficient Condition:Noisy Case)
Assume that δ < m , (3) and that (ˆ ⋆ ) is satisfied. Then, GLASS returns the correct species tree. roof: The proof follows immediately from the argument in Proposition 1 by not-ing that equation (2) becomes b D B ′ X ≤ D B ′ X + δ< t h B ′ ∪ X i − δ ≤ t h B ′ ∪ B ′′ i − δ ≤ D B ′ B ′′ − δ ≤ b D B ′ B ′′ . Condition (3) ensures that (ˆ ⋆ ) is satisfiable. (cid:4) Moreover, we have immediately:
Proposition 6 (Consistency & Rate: Noisy Case)
Assume that δ < m . Then GLASS is statistically consistent. Moreover, let
Λ = m − δm , then it holds that P [ Incorrect Reconstruction ] ≤ ( n − e − µ Λ ] k . In particular, for any ε > , taking k = 1 µ Λ ln (cid:18) n − ε (cid:19) , we get P [ Incorrect Reconstruction ] ≤ ε. The basic observation underlying our approach is that distances between popula-tions may be estimated correctly using the minimum divergence time among allindividuals and all genes.Actually, this observation may be used in conjunction with any distance-basedreconstruction algorithm. (See e.g. [Fel04, SS03] for background on distance ma-trix methods.) This can be done under very general assumptions as we discuss11ext. First, we do away with the molecular clock assumption. Indeed, it turns outthat D ( i ) ab need not be the divergence time between a and b for gene i . Instead, wetake D ( i ) ab to be the molecular distance between a and b in gene i , that is, the timeelapsed from the divergence point to a and b integrated against the rate of mutation.We require that the rate of mutation be the same for all genes and all individualsin the same branch of the species tree, but we allow rates to differ across branches.Below, all quantities of the type D , b D etc. are given in terms of this moleculardistance.For any two clusters A, B ⊆ L , we define b D AB = min n b D ( i ) ab : i ∈ I , r ∈ A, s ∈ B, a ∈ M ( i ) r , b ∈ M ( i ) s o , (4)as before. Let m ′ = min { t e ρ e : e ∈ E } , where ρ e is the rate of mutation on branch e . It is easy to generalize condition (ˆ ⋆ ) so that we can use (4) to estimate all molecular distances between pairs ofpopulations up to an additive error of, say, m ′ / . Then using standard four-pointmethods, we can reconstruct the species tree correctly.Note furthermore that by the results of [ESSW99], it suffices in fact to estimatedistances between pairs of populations that are “sufficiently close.” We can deriveconsistency conditions which guarantee the reconstruction of the correct speciestree in that case as well. References [DR06] J. H. Degnan and N. A. Rosenberg. Discordance of species trees withtheir most likely gene trees.
PLoS Genetics , 2(5), May 2006.[ELP07] Scott V. Edwards, Liang Liu, and Dennis K. Pearl. High-resolutionspecies trees without concatenation.
Proceedings of the NationalAcademy of Sciences , 104(14):5936–5941, 2007.[ESSW99] P. L. Erd¨os, M. A. Steel;, L. A. Sz´ekely, and T. A. Warnow. A fewlogs suffice to build (almost) all trees (part 1).
Random Struct. Algor. ,14(2):153–184, 1999.[Fel04] J. Felsenstein.
Inferring Phylogenies . Sinauer, New York, New York,2004. 12HM03] J. Hey and C. A. Machado. The study of structured populations–newhope for a difficult and divided science.
Nat. Rev. Genet. , 4(7):535–543, July 2003.[HSW05] Jotun Hein, Mikkel H. Schierup, and Carsten Wiuf.
Gene Genealogies,Variation and Evolution : A Primer in Coalescent Theory . OxfordUniversity Press, USA, February 2005.[KD07] L. S. Kubatko and J. H. Degnan. Inconsistency of phylogenetic esti-mates from concatenated data under coalescence.
Systematic Biology ,56(1):17–24, February 2007.[Kin82] J. F. C. Kingman. On the genealogy of large populations.
J. Appl.Probab. , (Special Vol. 19A):27–43, 1982. Essays in statistical science.[Kno04] L. L. Knowles. The burgeoning field of statistical phylogeography.
Journal of Evolutionary Biology , 17(1):1–10, 2004.[LP02] Knowles L. L. and Maddison W. P. Statistical phylogeography.
Mol.Ecol. , 11(12):2623–35, 2002.[Mad97] Wayne P. Maddison. Gene trees in species trees.
Systematic Biology ,46(3):523–536, 1997.[MK06] Wayne Maddison and L. Knowles. Inferring phylogeny despite in-complete lineage sorting.
Systematic Biology , 55(1):21–30, February2006.[Nic01] R Nichols. Gene trees and species trees are not the same.
Trends Ecol.Evol. , 16(7):358–364, July 2001.[Nor01] M. Nordborg. Coalescent theory. In D. J. Baldingand M. J. Bishop andC. Cannings, editors,
Handbook of Statistical Genetics , pages 179–212. John Wiley & Sons, Inc., Chichester, U.K., 2001.[PN88] P. Pamilo and M. Nei. Relationships between gene trees and speciestrees.
Mol. Biol. Evol. , 5(5):568–583, September 1988.[Ros02] N. A. Rosenberg. The probability of topological concordance of genetrees and species trees.
Theor. Popul. Biol. , 61(2):225–247, March2002.[SR07] M. Steel and Allen Rodrigo. Maximum likelihood supertrees. Preprint,2007. 13SS03] C. Semple and M. Steel.
Phylogenetics , volume 22 of
Mathematicsand its Applications series . Oxford University Press, 2003.[Tak89] N. Takahata. Gene genealogy in three related population: Consistencyprobability between gene and population trees.
Genetics , 122:957–966, 1989.[Tav84] S. Tavar´e. Line-of-descent and genealogical processes, and their appli-cations in population genetics models.
Theor. Popul. Biol. , 26(2):119–164, October 1984.[Tav04] Simon Tavar´e. Ancestral inference in population genetics. In
Lectureson probability theory and statistics , volume 1837 of