[PDF] Dated ancestral trees from binary trait data and its application to the diversification of languages

Abstract

Binary trait data record the presence or absence of distinguishing traits in individuals. We treat the problem of estimating ancestral trees with time depth from binary trait data. Simple analysis of such data is problematic. Each homology class of traits has a unique birth event on the tree, and the birth event of a trait visible at the leaves is biased towards the leaves. We propose a model-based analysis of such data, and present an MCMC algorithm that can sample from the resulting posterior distribution. Our model is based on using a birth-death process for the evolution of the elements of sets of traits. Our analysis correctly accounts for the removal of singleton traits, which are commonly discarded in real data sets. We illustrate Bayesian inference for two binary-trait data sets which arise in historical linguistics. The Bayesian approach allows for the incorporation of information from ancestral languages. The marginal prior distribution of the root time is uniform. We present a thorough analysis of the robustness of our results to model mispecification, through analysis of predictive distributions for external data, and fitting data simulated under alternative observation models. The reconstructed ages of tree nodes are relatively robust, whilst posterior probabilities for topology are not reliable.

Full PDF

aa r X i v : . [ s t a t . M E ] N ov Dated ancestral trees from binary trait data and its ap-plication to the diversiﬁcation of languages

Geoff K. Nicholls

Department of Statistics, Oxford, UK.

Russell D. Gray

Department of Psychology, Auckland University, Auckland, New Zealand.

Summary.

Binary trait data record the presence or absence of distinguishing traits in individ-uals. We treat the problem of estimating ancestral trees with time depth from binary trait data.Simple analysis of such data is problematic. Each homology class of traits has a unique birthevent on the tree, and the birth event of a trait visible at the leaves is biased towards the leaves.We propose a model-based analysis of such data, and present an MCMC algorithm that cansample from the resulting posterior distribution. Our model is based on using a birth-deathprocess for the evolution of the elements of sets of traits. Our analysis correctly accountsfor the removal of singleton traits, which are commonly discarded in real data sets. We illus-trate Bayesian inference for two binary-trait data sets which arise in historical linguistics. TheBayesian approach allows for the incorporation of information from ancestral languages. Themarginal prior distribution of the root time is uniform. We present a thorough analysis of therobustness of our results to model mispeciﬁcation, through analysis of predictive distributionsfor external data, and ﬁtting data simulated under alternative observation models. The recon-structed ages of tree nodes are relatively robust, whilst posterior probabilities for topology arenot reliable.

Keywords:

Phylogenetics, binary trait, dating methods, Bayesian inference, Markov chainMonte Carlo, glottochronology

1. Introduction

A great deal of progress has been made on the statistical analysis of DNA sequence data, andin particular for model-based estimation of genealogy. No equivalent statistical frameworkexists for trait-based cladistics. However, qualitative and quantitative trait data may beused to recover dated tree-like histories in situations where we have no genetic sequence data.Progress is possible when the traits are similar in type, so that some unifying assumptionabout their evolution is justiﬁed.We give statistical methodology for tree-estimation from binary trait data. These dataare made up of binary sequences, each sequence recording for one taxon the presence orabsence of a list of traits. Pigeon wings and sparrow wings are instances of the trait “birdwings” displayed at the taxa “Pigeon” and “Sparrow”. In our model, two instances ofa trait are necessarily homologous , that is, they descend from a common ancestor. Traitobservation models have many missing data. Birth times of observed traits are unknown,

Address for correspondence:

GK Nicholls, Department of Statistics, 1 South Parks Road, OxfordOX1 3TG, UK

E-mail: [email protected] and traits displayed at less than two taxa may be discarded. We model these missing data,integrate them out of the analysis analytically, and measure the random error using samplebased Bayesian inference.The outline of this paper is as follows. In the ﬁrst half (Sections 2 to 6) we set up Bayesianinference for a class of trait models. We give observation models, likelihood evaluation, prior,posterior and a MCMC scheme. In the second half of the paper (Sections 7 to 9) we applythe inference scheme to two closely related data sets. We review previous studies of thesedata and describe the particular models we ﬁt, then present results and model mispeciﬁca-tion analysis. Readers interested in the application only should read the ﬁrst paragraph of2, and all of Section 4, before jumping to the data analysis in Sections 7 and 8. Graphics il-lustrating this application make up the bulk of the supplement Nicholls and Gray (2007), see .Supplement section labels correspond to the section labels in this paper.We begin in Section 2 with the observation process. In Section 3 we give an eﬃcientscheme for evaluating the corresponding likelihood. The model, described in Huson and Steel(2004) and Atkinson et al. (2005), is the natural stochastic process representing Dollo’s par-simony criterion, since each instance of a given trait descends from a single innovation. InHuson and Steel (2004) the traits are distinct genes which are present or absent in an indi-vidual, and trees are built using a maximum-likelihood pairwise-distance, and the neighbor-joining methods of Saitou and Nei (1987). Our own work has been motivated by a pair ofdata-sets, Dyen et al. (1997) and Ringe et al. (2002), recording trait presence and absencefor Indo-European languages. Here traits are cognate classes (also called lexical traits ), thatis, homology classes of words of closely similar meaning. Thus English, Flemish and Danishshare the trait “all/alle/al” whilst Spanish, Catalan and Italian lack that trait, but share“todo/tot/tutto”.In Section 4 we write down two prior probability distributions for trees. The ﬁrst imposesan exponential penalty on branch length. The second is designed to be non-informative withrespect to the height of the root node, and otherwise uniform on the space of trees. Thisdistribution is a new class of tree priors and is likely to be useful in a broader phylogeneticsetting. General calibration constraints are introduced in Section 4 with speciﬁc examplesin Section 7.2. These constraints, derived from historical records, bound some tree nodeages above and below and are used to ﬁx trait birth and death rates. They determinethe parameter space of trees. In Section 5 we gather together the results of Section 3 andSection 4 and write down the full posterior density for the model parameters. We giveclosed form results for the two leaf case, and verify that we reproduce the distance measureof Huson and Steel (2004). We give a brief description of our MCMC algorithm for samplingthe posterior density in Section 6.In Section 7 we introduce the cognate data and the associated calibration constraintdata and summarize previous work. Section 8 has two parts. Section 8.1 gives furtherdetails of the model we ﬁt and Section 8.2 summarizes results and conclusions.Statistical contributions to the dating of language branching events have been rejectedby linguists. Dating eﬀorts are criticized for their assumption of a constant rate of languagechange at all times and in all places, the so called “glottal clock”. Bergsland and Vogt(1962) found examples of extreme rates, but employed counter-examples biased by data-selection. Blust (2000) links rate heterogeneity to long-branch attraction. However, neithercriticsm considers even the random component of the error. In this respect we are repeatingthe comments of Sankoﬀ (1973). In Section 9 we investigate model errors. Although we ﬁndevidence for model mispeciﬁcation we nevertheless reproduce, to within random error, age ncestral trees for binary trait data estimates in analyses across near-independent data, and in reconstructions from syntheticdata simulated under likely model-violation scenarios.

2. A model of binary trait evolution

In this section we specify an observation model for traits evolving on a ﬁxed tree. We prefacethis section with a qualitative description of the model. An individual is represented by a setof traits which are distinguishable and non-interacting. As calendar time passes, new traitsare born into the set at a constant rate. Each such birth generates a new homology class oftrait instances. At a tree node, two identical copies of the set entering the node leave thatnode. Two instances of each trait entering a node leave the node, one instance in each set.Each instance of each trait in each set dies independently at a constant rate. A graphicalillustration of the process and notation is given in the supplement, Nicholls and Gray (2007).Models of the tree itself are given in Section 4.We begin our formal description with notation for the tree. Let g = ( E, V, t ) be a rootedbinary tree with L leaves plus one extra node (label R ∗ ) ancestral to the root itself. The setof all nodes is then V = { , , . . . , L } , with leaf nodes V L , ancestral nodes V A and edges h i, j i ∈ E where i, j ∈ V and i < j . Node ages t = ( t , t , . . . , t L ) are ordered t i ≤ t i +1 and increase from the leaves to the root. The root node label is R = 2 L − R ∗ = 2 L with age t R ∗ = ∞ which is connected to the root via an edge h R, R ∗ i ∈ E . Our convention is R ∗ V A , so V = { R ∗ } ∪ V A ∪ V L . Leaves may be staggeredin time.Next we describe the evolution of traits. Sets of trait instances are sets of trait labelswhich evolve along the branches of g from the root towards the leaves, in the direction ofdecreasing age. Identify edge h i, j i by the node i at the base of that edge (edges are directedfor increasing age, from leaf to root, in the opposite direction to the calendar evolution oftraits, from root to leaf). For h i, j i ∈ E and τ ∈ [ t i , t j ) denote by ( τ, i ) a time point on abranch of g and by [ g ] = [ h i,j i∈ E [ τ ∈ [ t i ,t j ) { ( τ, i ) } , the set of all such points, including points on the edge h R, R ∗ i of inﬁnite length. For eachbranch h i, j i ∈ E deﬁne a set-valued process H ( τ, i ) = { h , h , . . . , h N ( τ,i ) } of trait labels h a ∈ Z , a = 1 , , . . . , N ( τ, i ). The elements of H ( τ, i ) are realized by a simple reversiblebirth-death process which acts along each edge of the tree. Set elements are born at constantrate λ . The label for each new born trait is unique to that trait, but otherwise arbitrary.Set elements die at constant per capita rate µ . At a branching event ( t i , i ) ∈ [ g ], set H ( t i , i )is copied onto the top of the two branches h j, i i and h k, i i emerging from i , so that theevolution of H ( τ, j ) and H ( τ, k ) is conditional on H ( t i , j ) = H ( t i , k ) = H ( t i , i ).The number of elements N ( t R , R ) in the root set is the number of traits born in[ t R , ∞ ) which survive to time t R . These surviving traits are generated at rate λ ( τ, R ) = λ exp( − µ ( τ − t R )), so their number is Poisson, mean λ/µ . The process may therefore beinitialized by simulating N ( t R , R ) ∼ Π( λ/µ ), and assigning N ( t R , R ) arbitrary trait labels H ( t R , R ) = { , , . . . , N ( t R , R ) } to the root set.The data D ( L ) = ( H i , i ∈ V L ), where H i = H ( t i , i ) , i ∈ V L (1) are an ordered list of the sets of trait labels observed at the tree leaves. Suppose that inall there are N distinct trait labels C = ( c , c , . . . , c N ) (so, C = ( ∪ i ∈ V L H i )) displayed atthe leaves. We can represent the data as N sets of taxa labels also, with set M a giving theleaves at which trait c a appears. This representation is D ( N ) = ( M , M , . . . , M N ), with M a = { i : c a ∈ H i , i ∈ V L } for a = 1 , . . . , N .Traits displayed at just one taxon are often dropped from the data. It is argued thatthese singleton traits do not inform tree topology. This is not the case in the model wehave described, since singleton traits are informative of time depth. Referring to the dataanalyzed in Section 7, Gray and Atkinson (2003) drop singleton traits in their binary reg-istration of the Dyen et al. (1997) data-set but retain them in their registration of theRinge et al. (2002) data. Let I c ∈ H ( t j ,j ) = 1 if c ∈ H ( t j , j ) and zero otherwise. The thinneddata is D ( L ) = ( H i , i ∈ V L ), with H i =  c ∈ H ( t i , i ) : X j ∈ V L I c ∈ H ( t j ,j ) >  , i ∈ V L . (2)We call this observation model, which drops singleton traits, NOUNIQUE, in contrast tothe model NOABSENT deﬁned by Equation (1). We write D for generic NOABSENT orNOUNIQUE data.Felsenstein (1992) gives the likelihood for a Poisson process acting on a ﬁnite state space,along the branches of a tree, conditioned to show states other than the zero state at theleaves. Lewis (2001) proposes applying certain trait models of this kind (so-called Jukes-Cantor models) to morphological character data, in a maximum likelihood analysis. Lewis(2001) mentions the problem of thinning traits displayed at a single taxon, and treats itby ensuring the data are not so thinned. Nylander et al. (2004) ﬁt models from the samefamily, allowing for the thinning of all parsimony uninformative characters (traits displayedat 0, 1, L − L leaves). These models do not constrain a trait to be generated at a singlebirth event. The authors model a ﬁxed number of traits which move back and forwardbetween diﬀerent categorical values indeﬁnitely. The number of distinct traits is ﬁxed forall time. We impose a single birth event for a trait and an evolution which proceeds fromabsence to presence to absence only. The number of distinct traits generated by our processis random, so that the total number is informative of the relative rates of birth and death.The model we have described resembles the Watterson (1975) inﬁnite sites model, but heretrait-death is in eﬀect back-mutation. Our model is similar to the inﬁnite alleles model ofKimura and Crow (1964), though the number of alleles is not random, whilst the numberof traits is random.

3. Likelihood calculations

The likelihood for g, µ and λ is given in terms of the distribution of the point process ofbirth points for those traits displayed in the data. Let X = { X , X , . . . , X N } be a randomset of trait birth-points in [ g ]. The Poisson process generating X is obtained by thinningrealizations of a constant rate process. Suppose a trait with label c is born at z ∈ [ g ];let O ( z ) = P i ∈ V L I c ∈ H i give the number of taxa displaying trait c (after any thinning). IfPr { O ( z ) > d | z, g, µ } is the probability for a trait, born at z ∈ [ g ] to appear in the data at d +1or more leaves, then the trait birth-rate at z in process X is λ ( z ) = λ Pr { O ( z ) > d | z, g, µ } ,where d = 0 under the NOABSENT observation model and d = 1 under NOUNIQUE. ncestral trees for binary trait data The distribution of X is deﬁned on the space X of all ﬁnite subsets x ⊂ [ g ]. For f : [ g ] → ℜ , deﬁne the integral R [ g ] f ( z ) dz along tree branches by Z [ g ] f ( z ) dz = X h i,j i∈ E Z t j t i f (( τ, i )) dτ. Now, suppose X = x with x = { x , x , . . . , x N } and x a = ( τ a , i a ) for a = 1 , . . . , N , so that x a ∈ [ g ] identiﬁes the point on the tree where trait c a was born. Let dx a = dz at x a = z .The density of the random set X = x , with respect to dx = dx dx . . . dx N on X , is f X ( x | g, µ, λ ) = exp − Z [ g ] λ ( z ) dz ! N Y a =1 λ ( x a ) . The total number of distinct traits in the data N ∼ Π (cid:16)R [ g ] λ ( z ) dz (cid:17) has a Poisson distribu-tion.Trait birth points are nuisance parameters, which we integrate out of the likelihoodunder the density f X . Denote by Pr { M a = m a | x a , g, µ, O ( x a ) > d } the probability for atrait, born at x a , to be displayed at the leaves listed in set m a and no others, conditionalon being displayed in at least d leaves. The likelihood, P ( D | g, µ, λ ), is P ( D | g, µ, λ ) = Z X P ( D | x, g, µ ) f X ( x | g, µ, λ ) dx = e − R [ g ] λ ( z ) dz N ! N Y a =1 λ Z [ g ] Pr { M a = m a | x a , g, µ, O ( x a ) > d } Pr { O ( x a ) > d | x a , g, µ } dx a . The outcome { M a = m a , O ( x a ) > d } is identical to the outcome { M a = m a } for traitsin the data, since those traits already satisfy the thinning condition card m a > d for each a = 1 , , . . . , N . It follows that events in the data satisfyPr { M a = m a | x a , g, µ, O ( x a ) > d } = Pr { M a = m a | x a , g, µ } Pr { O ( x a ) > d | x a , g, µ } , and consequently the likelihood is P ( D | g, µ, λ ) = 1 N ! exp − Z [ g ] λ ( z ) dz ! N Y a =1 λ Z [ g ] Pr { M a = m a | x a , g, µ } dx a . (3)We compute λ R [ g ] Pr { O ( z ) > d | z, g, µ } dz and the factors λ R [ g ] Pr { M a = m a | x a , g, µ } dx a using recursions related to the pruning recursion of Felsenstein (1981). We begin with λ R [ g ] Pr { O ( z ) > d | z, g, µ } dz . A birth at a generic point ( τ, i ) can be shifted to the childnode, ( t i , i ),Pr { O ( τ, i ) > d | ( τ, i ) , g, µ } = Pr { O ( t i , i ) > d | ( t i , i ) , g, µ } exp( − µ ( τ − t i )) , and the integral over [ g ] reduced to a sum over contributions from edges: λ Z [ g ] Pr { O ( z ) > d | z, g, µ } dz = λµ X h i,j i∈ E Pr { O ( t i , i ) > d | ( t i , i ) , g, µ } (1 − e − µ ( t j − t i ) ) . (4) We are interested in the cases d = 0 and d = 1. Let u ( d ) i ≡ Pr { O ( t i , i ) = d | ( t i , i ) , g, µ } soPr { O ( t i , i ) > d | ( t i , i ) , g, µ } = ( − u (0) i d = 0 , − u (0) i − u (1) i d = 1 . We give recursions for the u ( d ) i . Consider a pair of edges h j, i i , h k, i i in E . Let δ i,j = e − µ ( t i − t j ) . The recursions u (0) i = (cid:16) (1 − δ i,j ) + δ i,j u (0) j (cid:17) (cid:16) (1 − δ i,k ) + δ i,k u (0) k (cid:17) u (1) i = δ i,j (1 − δ i,k ) u (1) j + δ i,k (1 − δ i,j ) u (1) k + δ i,j δ i,k ( u (1) j u (0) k + u (0) j u (1) k ) (5)are evaluated from u (0) i = 0 and u (1) i = 1 at leaves i ∈ V L .We need to compute λ R [ g ] Pr { M a = m a | x a , g, µ } dx a for generic trait patterns. Trait c a is born into an edge ancestral to all the leaf nodes which display it, so the edges of g which contribute to the integral dx a are those edges, E a say, on the path to node R ∗ from the most recent common ancestor of the leaf nodes in m a . Also, m a is non-empty, soPr { M a = m a | ( τ, i ) , g, µ } = Pr { M a = m a | ( t i , i ) , g, µ } exp( − µ ( τ − t i )). We write the integralover [ g ] in terms of a sum over contributions from edges: λ Z [ g ] Pr { M a = m a | x a , g, µ } dx a = λµ X h i,j i∈ E a Pr { M a = m a | ( t i , i ) , g, µ } (1 − e − µ ( t j − t i ) ) . (6)Let V ( i ) L be the set of leaf nodes in V descended from node i , including i if node i is a leaf.For leaf sets m a let m ( i ) a = V ( i ) L ∩ m a . Consider two edges h j, i i , h k, i i in E . Events areindependent down the two branches,Pr { M ( i ) a = m ( i ) a | ( t i , i ) , g, µ } = Pr { M ( j ) a = m ( j ) a | ( t i , j ) , g, µ } Pr { M ( k ) a = m ( k ) a | ( t i , k ) , g, µ } , and moving from the top ( t i , j ) to the bottom ( t j , j ) of branch h j, i i ,Pr { M ( j ) a = m ( j ) a | ( t i , j ) , g, µ } = ( δ i,j × Pr { M ( j ) a = m ( j ) a | ( t j , j ) , g, µ } if m ( j ) a = ∅ ,(1 − δ i,j ) + δ i,j u (0) j if m ( j ) a = ∅ . (7)The recursion is evaluated from the leaves,Pr { M ( j ) a = m ( j ) a | ( t j , j ) , g, µ } = ( j is a leaf and m ( j ) a = { j } ,0 if j is a leaf and m ( j ) a = ∅ .The recursion need not reach the leaves. It can be evaluated from nodes j satisfying m ( j ) a = ∅ , using Equation (7) since u (0) j is computed for the R [ g ] λ ( z ) dz evaluation.

4. Prior models on trees

In this section we specify two families of probability distributions over trees, which we useto represent prior information concerning the phylogeny.One tree prior we use is a branching process G L with rate θ stopped at the instant ofthe L th branching event (counting the branching at the root). Denote by Γ the space of ncestral trees for binary trait data G L -realizable trees and by dg the measure Q i ∈ V A dt i , with counting measure on topologies.The process G L determines a density f G ( g | θ ) ∝ θ L − exp( − θ | g | )with respect to dg , where | g | is the sum of all branch lengths, excluding the branch h R, R ∗ i .The same functional form of the density is used when tree leaves are oﬀset in time.In Section 7, a hypothesis of the form “ t R ∈ [ t min , t max ]” is central. This motivates aprior which is non-informative with respect to such hypotheses. One prior which is stronglyinformative for t R is the prior density f ( g | T ) ∝ I t R ≤ T , the uniform distribution over alltrees in Γ with root age smaller than T , a ﬁxed upper limit. We ﬁnd that, for trees withisochronous leaves at t = t = . . . = t L = 0, the marginal distribution of t R is t L − R (foreach g ∈ Γ the topology-constrained volume integral R dt L +1 . . . dt L − ∝ t RL − ). Thisprior represents a state of belief in which Pr { t R ∈ [ T / , T ] } is about 2 L − times greaterthan Pr { t R ∈ [0 , T / } . The marginal density of t R in the prior f R ( g | T ) ∝ t − LR I t R ≤ T is uniform in [0 , T ].In Section 7.2, certain groups of taxa, called clades, are known to group together on thetree. Upper and lower bounds on the age of their common ancestor are used to calibraterate parameters. Admissible trees g ∈ Γ ′ , Γ ′ ⊆ Γ, satisfy these prior calibration constraints.Where calibration constraints are imposed, the prior f R must be modiﬁed, in order tomaintain a uniform prior distribution for the root age. Nodes in clades with clade roottimes bounded above by calibration constraints do not contribute a factor t R to the tree-topology constrained volume integral R Q i ∈ V A \{ R } dt i . The density f R must be furthermodiﬁed to take into account non-isochronous leaf dates. The exact result is beyond us.However, if S is a list of free nodes, ie nodes i ∈ V A \ { R } outside or above root-boundedclades, and for node i ∈ S in tree g ∈ Γ ′ , s i is the minimum time-value node i can achievein any admissible tree, then f R ( g | T ) ∝ I t R

5. Posterior distributions

Our ﬁnal expression for the likelihood is obtained by substituting Equation (4) and Equa-tion (6) into Equation (3), and evaluating these terms using Equation (5) and Equation (7) respectively. Multiplying that likelihood by the tree-prior f G given in Section 4 and a priordensity p ( µ, λ, θ ) for our rate parameters, we obtain the posterior distribution p ( g, µ, λ, θ | D ) dθdλdµdg ∝ exp  − λµ X h i,j i∈ E Pr { O ( t i , i ) > d | ( t i , i ) , g, µ } (1 − e − µ ( t j − t i ) )  × N Y a =1 X h i,j i∈ E a Pr { M a = m a | ( t i , i ) , g, µ } (1 − e − µ ( t j − t i ) ) × (cid:18) λµ (cid:19) N θ L − e − θ | g | p ( µ, λ, θ ) dθdλdµ Y i ∈ V A dt i . (8)Equation (8) holds for tree prior f G . Under tree prior f R ( g | T ) we drop parameter θ fromthe posterior and replace f G ∝ θ L − e − θ | g | with f R ∝ t − LR I t R t , . . . , t R , µ, λ, θ ) → ( t /ρ, . . . , t R /ρ, µρ, λρ, θρ ) leaves p ( g, µ, λ, θ | D ) dgdλdµdθ invariant, so it cannot be a proper distribution. The problem remains (for ρ > t R /T ) undertree prior f R . Date calibration data described in Section 4 and Section 7.2 restricts thespace of tree states from Γ to Γ ′ , and thereby breaks the time-rescaling invariance. Theposterior becomes proper.The special case of two taxa (so, 0 = t ≤ t ≤ t with t R = t the tree height) isof interest for checking and debugging. The data, D ( L ) = ( H , H ), are two lists of traitinstances, including traits present at just one leaf. Huson and Steel (2004) compute theMLE, | g | ∗ , for the total tree length | g | = 2 t R − t − t in a two leaf tree directly, using thereversibility of the birth-death process of traits between the two leaves, and conditioned on λ/µ known. In this way they motivate a new measure of the distance between two binarysequences as the pairwise maximum likelihood distance between the two sequences. Let n = card H \ H , n = card H \ H and n = card ( H ∩ H ), so that N = n + n + n .The data D amounts to n , n , n in the two leaf case. The likelihood for the two leaf case,computed from Equation (3), using Equation (4) and Equation (6), is P ( n , n , n || g | , λ, µ ) ∝ (cid:18) λµ (cid:19) N exp (cid:18) − λµ h − e − µ | g | i(cid:19) (cid:16) − e − µ | g | (cid:17) n + n e − µ | g | n . Maximizing this expression over | g | given λ/µ we recover the branch length calculated inHuson and Steel (2004). If instead we maximize P ( n , n , n || g | , λ, µ ) over λ and | g | ≥ t − t we get an estimate | g | ∗ for the time separation of two taxa, | g | ∗ = 1 µ log (cid:18) n + n n (cid:19) . (9)Swadesh (1952) ﬁts a relation of this kind to lexical trait data.The posterior distribution for | g | given µ , which is available in closed form for the two leaftree, is useful for debugging MCMC code. Taking priors p ( λ, θ ) = ( λθ ) − in Equation (8)and integrating out λ and θ , we obtain, p ( | g || µ, n , n , n ) ∝ µ | g | (cid:20) e − µ | g | − e − µ | g | (cid:21) n (cid:20) − e − µ | g | − e − µ | g | (cid:21) n + n . (10)Here µ and | g | appear in the combination µ | g | . When we consider large trees, and estimate µ , calibration constraints ﬁxing the age of clades in g separate this pair of variables. ncestral trees for binary trait data Table 1.

A miniature lexical “data set” with L = 3 languages, K = 3 meanings and N = 6 distinct traits, C = { , , . . . , } , from Dyen et al. (1997). “to give” “big” “we”Flemish geven groot wyDanish give stor viKashmiri dyunu bodu asi = ⇒ k = 1 k = 2 k = 3 i = 1 c = 1 c = 3 c = 6 i = 2 c = 1 c = 4 c = 6 i = 3 c = 2 c = 5 c = 6

6. Markov chain Monte Carlo

We work exclusively with the marginal posterior density p ( g, µ | D ). When the prior for λ is λ − , this variable is Gamma distributed in the posterior, and may be integrated. The sameobservation applies to θ , when we use the f G prior. Sampling the posterior distribution p ( g, µ | D ) via Metropolis-Hastings Markov chain Monte Carlo is fairly straightforward, onceeﬃcient schemes for evaluating and updating the recursions, Equations (5) and (7) havebeen implemented.We use the tree operations described in Drummond et al. (2002). These include up-dates which alter the tree topology, updates which vary node times, updates which varyparameters such as µ , and updates which make some combination of these changes. Ina specimen update we generate candidates for Metropolis-Hastings updates by simulating ρ ∼ U (1 / ,

2) and setting t ′ = ρt and µ ′ = µ/ρ , since this is expected to be a ridge directionof the loglikelihood. In the acceptance probability for this update, the probability densityto generate the reverse update, with ρ ′ = 1 /ρ , is equal to the probability density to gener-ate the forward update, and a Jacobian term | ∂ ( g ′ , µ ′ , ρ ′ ) /∂ ( g, µ, ρ ) | = ρ L − appears in theHastings ratio. The calibration constraints ﬁx certain taxa groupings as clades, and boundthe age of the most recent common ancestor of certain clades of taxa. These constraintsare implemented by rejecting proposed states that violate the constraints.Our MCMC convergence analysis, based on monitoring the asymptotic behavior of theautocorrelation for µ , t R , and the log-prior and log-likelihood, follows Geyer (1992). Wemade a number of checks on our implementation. We check that the computer functionfor the likelihood Equation (3) sums to one over data. We check that the marginal priordistribution of t R under f R with isochronous leaves is uniform. We recover the posteriordistribution in Equation (10) in the two leaf case. We ﬁx a data set and vary the proportionsin which update types are used. We check that statistics computed under the posterior donot vary, to within estimated errors. We recover the parameters of synthetic data, and theposterior distribution concentrates on the correct parameter values as the number, N , oftraits displayed in the data increases.

7. Data

In the Dyen et al. (1997) and Ringe et al. (2002) data, a trait is a homology class of words.The setup is illustrated in Table 1. A set of K meaning categories are chosen and, for eachof the L languages in the study, words in the K meaning categories are gathered. TheDyen et al. (1997) data uses the Swadesh (1952) “word list” (in fact a list of meanings). Inthis list, K = 200 core meaning categories (“All”, “And”, “Animal”,. . . ) are given. Words in the Swadesh meaning categories are relatively resistant to lateral trait transfer, referredto here as borrowing . Embleton (1986) observes that words borrowed from French and Latinmake up about 60% of the English lexicon, but less than 6% of the Swadesh 200-word list.The Ringe et al. (2002) data we have uses a list of K = 328 meanings, (plus morphologicaltraits, which we do not treat). There is a Swadesh list of K = 100 meaning categoriesthought to be particularly resistant to borrowing. The word lists are nested, so both datasets include the 200-word and 100-word lists.In the following, trait data collected by Gray and Atkinson (2003) for Hittite, TocharianA and Tocharian B are analysed with 84 languages (displayed in Fig. 3) from the Dyen et al.(1997) data. These merged data are referred to hereafter as the Dyen et al. (1997) data. Ofthe L = 24 languages in the Ringe et al. (2002) data (displayed in Fig. 4), 20 are ancient.In contrast, of the L = 87 languages in the Dyen et al. (1997) data, just the three addedby Gray and Atkinson (2003) are ancient. The two data sets are substantially independent.Both data sets are available in electronic format.The linguist identiﬁes homology classes among the words in a given meaning category.In order to avoid false identiﬁcation of homology, where there is merely a chance likenessof sound, linguists require close correspondence of meaning. Where words are judged tobe descended from a common ancestor they are assigned the same trait label. This oper-ation, which requires expert knowledge, is equivalent to replacing words with trait labels, c ∈ C , and thereby generating for each language i = 1 , , . . . , L and each meaning category k = 1 , , . . . , K a trait set H ( k ) i . In the context of this application, homology classes of traitsare called cognate classes . Both data sets mark some cognate classes as equivocal, and oﬀer“splitting” and “lumping” versions of the data. We present results for the “splitting” datawhich assigns separate labels to cognate classes which may in fact display a single homolo-gous trait. Results for the lumping data are very similar. We comment on this systematicerror in Section 9.6. Gray and Atkinson (2003) register the “splitting” Dyen et al. (1997)and Ringe et al. (2002) cognate data respectively as 87 × × H = { } , H = { } , H = { } ,. . . , H = { } . Looking at Section 2, we have an extra superscript ( k ) on trait-sets H i marking themeaning class. In Section 8 we start with one independent copy of the trait birth-deathprocess H ( τ, i ) for each meaning category.The vocabularies of some ancient languages are only partially reconstructed, creatinggaps in the binary sequence data. The Ringe et al. (2002) data marks these gaps. We areunable to treat missing data at this stage. We are obliged to drop from the analysis ofthe Ringe et al. (2002) data the languages Gothic, Lycian, Luvian, Oscan, Umbrian, OldPrussian, Old Persian, Avestan and Tocharian A, leaving the languages in Fig. 4. We retainsome languages with small numbers of gaps, simply marking the gap as trait-absence. Wediscuss the associated model mis-speciﬁcation bias in Section 9.7. The number of gaps inour registration of the Dyen et al. (1997) data is negligible. Historical sources provide rate calibration data for these Indo-European data sets. Atkinson et al.(2005) compile calibration points. For example, the Brythonic languages

Welsh _ N, Welsh _ C,Breton _ List, Breton _ SE and Breton _ ST form a clade in the Dyen et al. (1997) data, witha common ancestor between 1450 and 1600 years before the present (BP, where the presentis the year 2000 - only roughly the time the data was gathered, because the dating accu-racy is in any case low). In our analysis of the two data sets we imposed 16 groups of ncestral trees for binary trait data taxa as clades: Brythonic, Celtic, Italic, Iberian-French, Germanic, West Germanic, NorthGermanic, Balto-Slav, Slav, Indic, Indo-Iranian, Iranian, Albanian, Greek, Armenian andTocharic.The calibration points marked by horizontal bars in the sample tree states below giveboth lower and upper bounds on clade root times. Each such calibration point gives anindependent estimate for λ , µ and θ . Prior knowledge providing only a lower bound onlanguage branching (“languages A and B were distinct by year C”) is more common, but lessvaluable, as it does not break the scale invariance discussed in Section 5. Yang and Rannala(2006) observe that, in a phylogenetic setting, there is often good evidence for the lowerbound, but little conﬁdence in the upper bound. The same applies here, since the upperbound for a split is supported by the absence of historical evidence for separate vocabularies.However, uncertainties in the positions of the upper bounds are of the order of hundreds ofyears, whilst the evolution rates we are calibrating give half lives of thousands of years, sowe have not pursued this source of uncertainty. We do see one result which suggests thata soft upper limit would have improved the analysis. The one incorrect clade age estimate(Balto-Slav) we see in the cross-validation study (Supplement, Section 9.4) fell above theupper limit of the calibration interval. That upper limit is diﬀerent from the others, sinceit is not based upon historical texts, but is instead derived from consideration of excavatedcultural remains. The link between language and excavated culture is obviously weakerthan language and literature. The survey given in Sankoﬀ (1973) summarizes models of cognate trait data. Sankoﬀ(1973) presents relatively realistic models which are complex and parameter-rich. Sankoﬀ(1973) discusses inference based on pairwise distances between the binary-trait data-vectorsof two languages. This mode of inference has been the norm for lexical cognate classdata. Thus Dyen et al. (1992) use classical hierarchical clustering of data-vectors basedon pairwise distances between languages to establish a tree of languages. In contrast,Gray and Atkinson (2003) use Ronquist and Huelsenbeck (2003) MrBayes software and theBayesian phylogenetic methods of Yang and Rannala (1997), to ﬁt the ﬁnite-sites DNAsequence model of Felsenstein (1981). The MrBayes software allowed them to account forthe thinning of traits surviving into zero taxa. Pagel and Meade (2006) describe and ﬁta related, more realistic, model of cognate replacement within meaning category. Thesemodels allow traits identiﬁed in the data as homologous to arise by independent innovation.Warnow et al. (2006) propose a model in which each homology class has a unique birthevent. However, there is to date no statistical inference for the model. Ringe et al. (2002),Erdem et al. (2006) and Nakhleh et al. (2005) reject dating, and avoid explicit modelling.They make a parsimony analysis without explicit measures of uncertainty. They allowsome lateral transfer of traits, and thereby generalize to graphs which are not trees. Theyemploy expert linguistic intervention in the inference, which becomes a well informed searchthrough phylogenies. In light of the random and systematic error we measure below, we donot expect estimators of tree topology related to the mode ( ie parsimony) to be adequate.However Ringe et al. (2002) add morphological traits. These may be more reliable datathan cognate traits. Such traits can be analysed in the framework we set out. Garrett(2006) shows that the breakup of dialect continua into languages is not tree-like. The localborrowing model we give in Section 9.1 generates similar model violations.

8. Inference

When we ﬁt the model of Section 2 to the Dyen et al. (1997) and Ringe et al. (2002) data, weidentify a model mis-speciﬁcation problem. For meaning classes k = 1 , , . . . , K denote by H ( k ) ( τ, i ) a trait birth-death process modelling the evolution of words in meaning category k , so that for i = 1 , , . . . , L , H ( k ) i = H ( k ) ( t i , i ) is the data at the leaves under NOABSENT.Let λ ( k ) and µ ( k ) be the birth and death rates for traits in meaning class k . It is reasonableto expect any real language to have at least one word in each of the semantic ﬁelds in theSwadesh 200-word list at all times. It follows that the birth-death process must satisfy a no-empty-ﬁeld condition, H ( k ) ( τ, i ) = ∅ or N ( τ, i ) >

0, for each ( τ, i ) ∈ [ g ].We ignore this no-empty-ﬁeld condition in our analysis. We lump together the K copiesof the birth-death process of traits corresponding to the diﬀerent meaning classes. Underthe empty-ﬁeld approximation, and assuming the death rates µ = µ ( k ) , k = 1 , , . . . , K areall equal (see Section 9.3), the superposition H ( τ, i ) = K [ k =1 H ( k ) ( τ, i ) (11)of birth-death processes generates another instance of the same process, with birth rate λ = P k λ ( k ) and death rate µ . If the mean number of words per meaning category islarge, then the process does not visit the constraint, so the approximation holds. As weshow in Section 9.5, this condition is not satisﬁed. However, when we study syntheticdata simulated under the empty-empty-ﬁeld condition, the calibration constraint forces theﬁtting procedure to adapt the approximating model to data by distorting estimates of µ ,and thereby reproduce the uncalibrated clade root ages very well.We carry out MCMC from posterior distributions p ( g, µ | D ) determined by Equation (8)and the Dyen et al. (1997) and Ringe et al. (2002) data, under the NOUNIQUE observa-tion model. We repeated the analysis with NOABSENT for the Ringe et al. (2002) data,obtaining similar results. We apply the branching process prior f G with prior 1 /θ , and theuniform root prior f R with T = 16000 (an uncontroversial upper limit on t R ). The dataoverwhelm these two priors, diﬀerences between posterior estimates obtained under the twopriors are slight, and we therefore discuss results for the prior f R in this paper and verybrieﬂy report f G -results in the supplement. Results are completely insensitive to the choiceof T , for all T suﬃciently large.In our search for conﬂicting signals in the data, we analyzed (in addition) subsets ofthe data. As discussed in Section 9.1, analyses of subsets of languages may be less exposedto error due to certain forms of borrowing. On the other hand we may uncover rate het-erogeneity between word lists or between groups of languages. We reduce the Dyen et al.(1997) data to the Swadesh 100-word list, and the Ringe et al. (2002) data to the Swadesh200- and 100-word lists. We thin the Dyen et al. (1997) data from L = 87 languages downto two sets containing L = 31 languages, and L = 30 languages (the two subsets are dis-played in the supplementary material), chosen in such a way that the pivotal calibratingdates remain applicable. These two data subsets overlap at 8 languages, but just one ofthe ﬁve calibration points has any common data (Tocharian, where there is no choice). Welabel analyses “Data-1st-author / Word List / Number of Leaves”. ncestral trees for binary trait data Figures 1 and 2 give a compact quantitative summary of the Dyen / /

87, Dyen / / / /

31, Dyen / /

30, Dyen / /

30 Ringe / /

15, Ringe / /

15 andRinge / /

17 posterior distributions. The posterior probabilities for a selection of cladesare displayed in Fig. 1, and clade labels

BGCI , GCI , CI , CG , GI , GrA , notHT and notH deﬁned. The posterior mean age for the common ancestor of the languages deﬁning eachcorresponding clade is displayed in Fig. 2. This format is useful for identifying conﬂictbetween data subsets, once clades of interest have been identiﬁed. notHT HT notH CI CG IG CIG BCIG GrA00.10.20.30.40.50.60.70.80.91 po s t e r i o r p r obab ili t y Dyen/200/87 Dyen/200/31 Dyen/200/30 Dyen/100/87 Dyen/100/31 Dyen/100/30 Ringe/430/15 Ringe/200/15 Ringe/100/17

Fig. 1.

Posterior probabilites for selected clades, across data sets. x -axis labels: BGCI, Balto-Slav-Germanic-Celtic-Italic; GCI, Germanic-Celtic-Italic; CI, Celtic-Italic; CG, Celtic-Germanic; GI,Germanic-Italic; GrA, Greek-Armenian; notHT , complement of Hittite-Tocharian; notH , complementof Hittite. In the supplement, Nicholls and Gray (2007), we deﬁne and display consensus trees, acentral point estimate for topology and branch length. The consensus tree is not a state inthe sample space of trees. Prior constraints and leaf ages are represented on sampled statesso we give, in Figures 3 and 4, samples drawn from the Dyen / /

87 and Ringe / / BCIG group is strongly supported in all analyses (except Ringe / /

17, which doesat least allow it). The

CIG group is supported in all analyses (except Dyen / /

30, whichallows it). However all the sub-clades CI , CG and IG are at odds with at least one dataset. Age estimates for the common ancestor of the languages in the CIG clade are, in allanalyses, close to the age estimates for the common ancestors of the subclades, suggestingthe breakup occurred in a relatively small interval of time, so the split structure is poorlyresolved.Referring again to the clade probabilities, Fig. 1, and the consensus trees in the sup-plement, Hittite and Tocharian form an outgroup in the three Dyen / /Y analyses, aregrouped with Greek and Armenian in the Dyen / /Y analyses, and are split in the threeRinge /X/Y analyses. There are many model mispeciﬁcation issues for Hittite and Tochar- notHT HT notH CI CG IG CIG BCIG GrA all2000400060008000100001200014000 m ean po s t e r i o r age [ y ea r s BP ] Dyen/200/87 Dyen/200/31 Dyen/200/30 Dyen/100/87 Dyen/100/31 Dyen/100/30 Ringe/430/15 Ringe/200/15 Ringe/100/17

Fig. 2.

The mean posterior ages, in years BP, for the most recent common ancestor (MRCA, super-clade root) of languages in selected combinations of clades ( ie super-clades). x -axis labels as forFig. 1. ian. Comparing notH and all in Fig. 2, Hittite adds 1000 years to the posterior mean root agein the Ringe / /

15 analysis. The contrast between the Dyen et al. (1997) and Ringe et al.(2002) analyses is most clearly visible in the notH and notHT columns of Figures 1 and 2.In contrast, conﬂicts between analyses of the Swadesh 100-word list (Dyen / / / /

31, Dyen / /

30 and Ringe / /

17, symbols + ∗ ✷ △ ) are almost absent. Both CI and IG (see Fig. 1) are allowed by these analyses. Ringe / /

17 allows the clade HT anddoes not impose HT as an outgroup (which would be a conﬂict, as notHT is not a clade ofDyen / /Y ). In other areas of conﬂict, the Dyen / /Y analyses allow a GrA clade. Thislack of conﬂict comes at the price of greater random error (compared to analyses on longerword-lists). One striking conﬂict remains: the position of Indo-Iranian relative to the rootis quite diﬀerent in the Ringe / /

17, and Dyen / /Y analyses.The posterior mean ages for the notH , notHT , all and GrA , which show particular conﬂictin Fig. 2, are in fair agreement for analyses based on the Swadesh 100-word list. Comparingthe Dyen / /

30 and Ringe / /

17 analyses, and looking at Fig. 2 and Supplement-Fig. 7,the clade root ages for notHT do not agree in either simulation. Otherwise there is agreementbetween the four analyses, on the ten measured ages, under one or both prior weightings.This reduced ( K = 100) set of traits is chosen to be resistent to borrowing. Posterior pre-dictive replicates computed in Section 9.3 show little evidence of rate heterogeneity withinthis class of traits. The corresponding words are relatively well attested in otherwise incom-pletely reconstructed ancient languages, so there is little missing data. In Section 9.2 wecompute posterior predictive distributions for singleton traits in the Ringe / /

17 analysis;these agree well with external data.In summary, the systematic errors displayed in our four age estimates from the Swadesh100-word list are representative. On the other hand, most features of tree topology whichwere in doubt, remain in doubt. ncestral trees for binary trait data KashmiriGypsy_GkSinghaleseMarathiGujaratiHindiPanjabi_STLahndaBengaliNepali_ListKhaskuraOsseticBaluchiWakhiPersian_ListTadzikAfghanWaziriAlbanian_CAlbanian_KAlbanian_GAlbanian_TopAlbanian_THITTITE TOCHARIAN_ATOCHARIAN_BArmenian_ListArmenian_ModGreek_KGreek_ModGreek_MLGreek_MDGreek_DRiksmalDanishFaroeseIcelandic_STSwedish_UpSwedish_VLSwedish_ListGerman_STPenn_DutchAfrikaansDutch_ListFlemishFrisianEnglish_STTakitakiSardinian_CSardinian_NSardinian_LRomanian_ListVlachLadinItalianCatalanProvencalFrench_Creole_CFrench_Creole_DWalloonFrenchPortuguese_STBrazilianSpanishWelsh_CWelsh_NBreton_SEBreton_STBreton_ListIrish_AIrish_BLatvianLithuanian_OLithuanian_STUkrainianByelorussianPolishRussianLusatian_LLusatian_USlovakCzech_ECzechSlovenianSerbocroatianBulgarianMacedonian0 2000 4000 6000 8000 10000

Fig. 3.

Tree sampled from the

Dyen / / posterior distribution. x -axis gives age in years. Priorconstraints on eight clade root and three leaf ages are indicated by horizontal bars. In order to reduceclutter, a single bar shows the two Tocharian A and B constraints, which are near equal.6 hittite tocharian_blatin oldhighgermanoldenglisholdnorseoldirish welshlithuanianlatvianoldcslavonicvedic greek armenian albanian0 10002000300040005000600070008000 Fig. 4.

A sampled state illustrating the

Ringe / / posterior distribution. Prior constraints ontopology impose 7 clades. Prior uncertainties in clade root and leaf ages are indicated by horizontalbars.

9. Model mis-speciﬁcation

We head this section with a summary of its results. These results coincide with the conclu-sions we draw from the between-data analyses in Section 8.2: our age estimates are robust;tree topology less so. In Fig. 5 and Fig. 6 we present results from synthetic data, simulatedon a tree sampled from the posterior distribution of the

Ringe/ /

15 analysis (true cladestructure is marked in Fig. 5 with 0 below false clades and 1 below true; the true tree isin Nicholls and Gray (2007)), under a range of observation models intended to mimic likelymodel mis-speciﬁcation. Details of these models, which simulate the empty-ﬁeld-condition,plausible levels of borrowing and branch-wise and trait-wise rate heterogeneity, are given inSections 9.1 through 9.7.Clades imposed in reconstructions from synthetic data are indexed s- and are the sameas the clades deﬁned below Fig. 1 and displayed as horizontal bars in Fig. 4. Analyses ofsynthetic data are indexed “S/X/Y”. Values of X indicate borrowing and rate heterogeneity:X=“T” is no borrowing; X=“G b ” is global borrowing at rate bµ ; X=“L z − b ” is localborrowing, between languages with a common ancestor not more than z years in the past,at rate bµ ; and X=“BH ρ ” and X=“MH ρ ” have rates drawn independently, for each branch(BH) and meaning category (MH) respectively, from a Gamma distribution with mean µ and standard deviation ( ρ/ µ . Values of Y show the constraint applied: Y=“U n ” is theunconstrained birth death process of set elements, with n = λ/µ the expected number of ncestral trees for binary trait data distinct traits at each leaf, under the NOABSENT observation model; Y=“C n ” simulatescognate classes under the no-empty-ﬁeld constraint, using n meaning categories. s−notHT:1 s−HT:0 s−notH:1 s−CI:0 s−CG:0 s−IG:1 s−CIG:1 s−BCIG:1 s−GrA:100.10.20.30.40.50.60.70.80.91 clade po s t e r i o r s uppo r t Clade support across synthetic data sets

S/T/U200S/G0.1/U200S/G0.2/U200S/G0.5/U200S/L500−1/U200S/T/C200S/L500−1/C200S/BH10/U200S/BH33/U200S/BH50/U200S/MH25/U200S/MH50/U200S/MH100/U200

Fig. 5.

Synthetic data yields estimates of posterior probabilities for selected clades, across syn-thetic data sets. x -axis labels as for Fig. 1 with s- preﬁx indicating synthetic and 0/1 indicatingabsence/presence of the clade in the true tree. s−notHT s−HT s−notH s−CI s−CG s−IG s−CIG s−BCIG s−GrA s−all20003000400050006000700080009000100001100012000 clades m ean po s t e r i o r age [ y ea r s BP ] Clade ages across synthetic data sets

S/T/U200S/G0.1/U200S/G0.2/U200S/G0.5/U200S/L500−1/U200S/T/C200S/L500−1/C200S/BH10/U200S/BH33/U200S/BH50/U200S/MH25/U200S/MH50/U200S/MH100/U200true age

Fig. 6.

Synthetic data yields estimates of mean posterior ages, in years BP, for the most recentcommon ancestor (MRCA, super-clade root) as in Fig. 1 with s- preﬁx indicating synthetic.

Systematic errors in tree node ages inferred from synthetic data generated under thesemodels are in general small. With the exception of

S/BH /U

200 (see Section 9.4), thesystematic error we generated in Fig. 6 is of the same order of magnitude as, or smallerthan, random error. Systematic error in estimated rates µ (not shown) is highly signiﬁcant.Calibration data ﬁxes dates and topology for that part of the tree adjacent to the leaves,forcing the inference to accommodate model mis-speciﬁcation by adjusting rates. The mod-iﬁed rates ﬁt the imposed trait evolution adjacent to the leaves. If model mis-speciﬁcationis homogeneous over the tree, as is the case for the empty-ﬁeld-approximation, trait evolu-tion deep in the tree may be well represented by these biased rates, and date estimates are accordingly robust.Tree topology is not robust to the model mis-speciﬁcation we explored. The “true”s-clades s-GrA , s-BCIG , and the rooting clades s-notHT , and s-notH have robust support atlevels which do not allow rejection of the truth. The “true” s-clades s-CIG and s-IG are notwell reconstructed when borrowing is substantial. The branching at the top of the superclade s-BCIG is poorly resolved as the s-BCIG , s-CIG and s-IG branches are separated by just1000 years in the tree on which the synthetic data was simulated (see Nicholls and Gray(2007)), which is small compared to µ − ≃ b = 0 . , . Word borrowing from languages outside the study is straightforward trait birth (unless thesame word is borrowed into several languages). If we delete a source language from ourstudy, we thereby remove the model error associated with borrowing from that language.The consistency we see in Fig. 2 between clade ages reconstructed for near-disjoint subsets oflanguages, and the full set, in the Dyen / /

87, Dyen / /

31 and Dyen / /

30 posteriors,suggests that borrowing is not distorting the Dyen / /

87 estimates themselves.Our models of borrowing are as follows. We associate with each time slice τ ∈ [0 , ∞ )across the tree a linkage graph ( E ( τ ) , V ( τ )) with nodes, V ( τ ) = { ( τ, i ); ( τ, i ) ∈ [ g ] } , corre-sponding to points in [ g ] intersected by the slice. The linkage graph models traﬃc betweenlanguages; its edges h y, z i ∈ E ( τ ) connect sets H ( y ) and H ( z ) between which trait instancescan pass. Let ˜ V ( z ) = { z ∈ V ( τ ) : ∃ y ∈ V ( τ ) , h y, z i ∈ E ( τ ) } be the set of nodes adjacent to z ∈ V ( τ ). Let b denote the relative rate of word-borrowing to word-death. At per capitarate bµ each instance of each trait in each language in the time slice τ generates a borrowingevent. Suppose the selected trait-instance is in language z ∈ V ( τ ) and is labeled c ∈ C . Alanguage y ∈ ˜ V ( z ) is chosen, uniformly at random from nodes adjacent to z on ( E ( τ ) , V ( τ )),and we set H ( y ) ← H ( y ) ∪ { c } , ie the word is copied into the target language.We model local borrowing as follows. Words transfer between languages which have asuﬃciently recent common ancestor. The linkage graph at time t includes an edge from( t, i ) to ( t, j ) if points ( t, i ) and ( t, j ) in [ g ] have a common ancestor less than z years inthe past. In this model linked groups of languages break up into linked subgroups. In ourmodel of widespread borrowing (the “global” borrowing model), all languages communicateequally with all other languages, and the linkage graph is the complete graph.Our exploration of these models is summarized in Figures 5 and 6 by the three S/G b /U200data sets and the S/L500 − M a ), the number of languages displaying cognate a , tails oﬀrapidly, so that few cognates are displayed in many languages. At b ≃

1, cognates simplysurvive too well, and many cognates from deep in the tree survive into many languages.Local borrowing has time depth z = 500 and a borrowing rate equal to the death rate. Wesee from Fig. 6 that age estimates are robust to this form of model mis-speciﬁcation. ncestral trees for binary trait data Where the observation model is NOABSENT, singleton traits are present, and we canuse them to test the model. We drop them from the data, carry out the inference underNOUNIQUE, and then see if we can predict the number of singleton traits for each taxon.This check was available for the Ringe et al. (2002) data. We expect rate heterogeneity andborrowing to be visible (but probably not distinguishable) in these tests.Denote by ˜ D synthetic trait data generated under the NOABSENT observation model,displaying ˜ N distinct traits. For trait a = 1 , , . . . , ˜ N let ˜ M a give the indices of leavesdisplaying an instance of trait a for predicted data ˜ D , and ˜ X i be the number of singletontraits in ˜ D at taxon i ,˜ X i = card { ˜ M a : ˜ M a = { i } , a = 1 , , . . . , ˜ N } i ∈ V L . The posterior predictive distribution Pr { ˜ D | D } isPr { ˜ D | D } = Z Pr { ˜ D | g, µ, λ } p ( g, µ, λ | D ) dgdµdλ and this determines a predictive distribution for ˜ X i . We sample µ, λ and g from the poste-rior p ( g, µ, λ | D ) ( g and µ are available from MCMC output; we restore λ by sampling itsposterior conditional density), simulate synthetic data ˜ D at the leaves of g , and compute˜ X i from ˜ D . Let X i ( D ) = card { M a : M a = { i } , a = 1 , , . . . , N } , denote the number ofsingleton traits at taxon i in the original real data itself.Predictive distributions for X i are given in Supplement-Fig. 9. The predictive distribu-tions for ˜ X i over-estimate the X i in the Ringe et al. (2002) data with K = 328 meaningcategories. Since borrowing depletes singleton traits, this is consistent with model mis-speciﬁcation due to borrowing. Also, we expect borrowing to be weaker on the shorterword-lists ( K = 100 , K = 100). Rate heterogeneity can mimic this behavior. Correspond-ing studies for synthetic data are given in Supplement-Fig. 10,11. Predictive distributionsfrom the shorter word-lists are in good agreement with the data. Pagel and Meade (2006) show that the evolution rates of words are, for a given meaning cat-egory, fairly consistent across data sets, whilst varying more substantially between meaningcategories. Fig. 7 displays a tendency for the shorter word-lists to evolve at relatively slowerrates. This is expected. However the rate variation between data sets in Fig. 7 does notlead directly to variation in estimated root times in Fig. 2. For example, Dyen / /

87 andDyen / /

87 diﬀer by a factor 1.5 in posterior mean rate, but by just 1.2 in root age. Timedepth measurements do not depend on an assumption of constant rates between analyses,since rates are estimated from calibration points in the recent history of the same data usedto predict branching times.In order to generate synthetic data with rate heterogeneity across meaning classes (the

S/M Hρ/U

200 simulations), we draw rates µ ( k ) ∼ Gamma( α, β ) independently for each meaning category k = 1 , , . . . , K , with mean αβ = µ and variance αβ = ( rµ ) , where r = 0 . , . r = 1, simulate a trait process H ( k ) ( τ, i ) at rate λ, µ ( k ) , merge meaning categories, as Equation (11), and then read oﬀ data, as Equation (2).Pagel and Meade (2006) estimate that the rate variance over meaning classes is ( rµ ) ≃ µ / M a , oflanguages in which trait a ∈ C appears. Denote by Y ( n ) = Y ( n ) ( D ) the number of traitsdisplayed at n leaves, Y ( n ) ( D ) = card { M a : card M a = n, a = 1 , , . . . , N } , and let ˜ Y ( n ) = Y ( n ) ( ˜ D ) be the corresponding random variable computed from posteriorpredictive data ˜ D ∼ Pr { ˜ D | D } . We plot E ( ˜ Y ( n ) | D ) − Y ( n ) and the envelope ± Y ( n ) | D ).In the supplementary material (Supplement-Fig. 12) we show that the ﬁtting procedure isunable to reproduce the trait frequency distribution in synthetic data with high levels ofrate heterogeneity across meaning classes (standard deviation 50% of the mean) but lowerlevels (25%) are invisible.Returning to the real data, in Supplement-Fig. 13 (left), some inconsistency attributableto rate heterogeneity between traits is visible in our Ringe / /

15 analysis. Among otherproblems, the data contains an excess of traits appearing in 10 or more leaves. This iscaused by a small cohort of traits evolving at death rate µ small compared to the rest.The eﬀect is very greatly reduced in the Ringe / /

17 analysis (Supplement-Fig. 13, right).Analyses of the Dyen et al. (1997) data show a similar pattern.

Time depth measurements depend on some assumption about the way rates have changedover time within each data set we analyze. The variations in rates between clades withineach of the D100, D200, R328 and R100 groups in Fig. 7 give us an indication of the un-modelled rate variation we can expect in the deeper branches of the tree. We give theper-trait-instance death rate µ for each clade calibration constraint independently. Wesampled the posterior distributions p ( g, µ | D clade ) determined by the data for each clade inturn. We could use the posterior rate distribution from one calibration clade as a prior topredict the age range for the root of another calibration clade. Where conﬁdence intervalsfor the reconstructed rates of a given data set overlap, the corresponding predictions will begood. Such prediction is legitimate within one data set only, so we compare rates betweenvertical dashed lines.In the supplement for this section we report and discuss a similar cross-validation exerciseon the Dyen / /

87 analysis. We drop each calibration-clade in turn (both topology and ageconstraint) and estimate the clade root age using all the word-list data and the remainingcalibration points. Of 10 such tests, 8 succeed. The predicted age range for Balto-Slav isslightly too deep, and that of Hittite very signiﬁcantly too young.Synthetic data with spatio-temporal rate variation (

S/BHρ/U

200 analyses), have rates µ h i,j i ∼ Gamma( α, β )drawn independently on each edge h i, j i ∈ E , with mean αβ = µ and variance αβ = ( rµ ) ,where r = ρ/ ncestral trees for binary trait data −4 µ Y ea r s − D − D − R − R − R − D − It a l D − I b F r D − G e r m D − B a S l D − B r y t D − It a l D − I b F r D − G e r m D − B a S l D − B r y t R − N W G e r m R − W G e r m * R − B a S l R − B a l t * R − N W G e r m R − W G e r m * R − B a S l R − B a l t * R − C e l t * R − C e l t * Fig. 7.

Posterior mean values of µ , with standard errors at two standard deviations, measured in anal-yses Dyen / / (D200-87), Dyen / / (D100-87), Ringe / / (R328-15), Ringe / / (R200-15) and Dyen / / (R100-17), and independently using calibration points in distinct clades.The observation model is NOUNIQUE, except for two-leaf clades marked *, where NOABSENT mustbe used. rate. Lees (1953) sees 20% variation in rate estimates from pairs of languages. Results arerobust to moderate levels of unstructured, random rate variation of this kind.Results are of course not robust to structured rate variation, in which, for example,rates on edges at ages greater than any calibration point are all larger than any rates inthe calibration zone, or a single-taxon outgroup has an extreme rate. In S/BH /U s − hittite happens to have a high rate. Its root is pushed to great age, and with it goesthe root of the tree. The analysis is exposed to rare “catastrophic” trait-evolution eventsoutside the calibration zone. We checked that that no single language or small outgroup isdetermining the root age in the Ringe / /

17, and Dyen / /Y analyses. Agreement be-tween reconstructions based on the predominantly ancient languages of Ringe et al. (2002)and modern languages of Dyen et al. (1997) shows that there is at least no such structuredrate variation in the recent past. Our empty-ﬁeld approximation will be good if there is signiﬁcant “polymorphism”, that is,if the mean number λ ( k ) /µ ( k ) of traits ( ie , words per meaning category) in the H ( k ) ( τ, i )-process is large. We estimate λ/µ at 273(9) for the Dyen et al. (1997) Swadesh-200 dataand 280(25) for the Ringe et al. (2002) Swadesh-200 data (posterior standard deviation inparenthesis) and hence λ ( k ) /µ ( k ) ≃ .

4. The probability, exp( − λ ( k ) /µ ( k ) ) ≃ /

4, to ﬁndthe unconstrained trait-set process H ( k ) ( τ, i ) in the empty set at any single ﬁxed point( τ, i ) ∈ [ g ] is high enough to cause concern.We simulate synthetic data from the trait birth-death process constrained to respectthe no-empty-ﬁeld condition. For each of the k = 1 , , . . . , K meaning classes, we simulate N ( k ) ( t R , R ) from a Poisson distribution constrained to be greater than zero, then simulate H ( k ) ( τ, i ) | N ( k ) ( τ, i ) > g ]. The total rate for the exponential waiting time to thenext event does not include µ if N ( k ) ( τ, i ) = 1. We then merge the meaning classes as in Equation (11). Our studies are represented here by two simulations, S/T/C200 andS/L500-1/C200, the latter including local borrowing. The per-capita death rate µ was setto a large value, so that polymorhpism was low. We ﬁnd, when we ﬁt data of this kind,that the tree and its dates are robust to this form of model mis-speciﬁcation. When the scientist groups instances of traits into homology classes, instances of traits borndeep in the tree may be highly evolved, and correspondingly diﬃcult to identify as in facthomologous. This error can populate the deeper branches of the tree with spurious birthevents. This is a case where model mis-speciﬁcation is not homogeneous over the tree, andwill lead to over-estimation of the tree depth. When we replace the Ringe et al. (2002)“splitting” data with the Ringe et al. (2002) “lumping” data we do see a 3% downwardshift in the estimated root time.

In our analysis of the Ringe et al. (2002) data, we retain some languages with gaps, corre-sponding to missing data. We replace these gaps with zeros, marking trait absence. Gappylanguages (Hittite, Tocharian) do stand out in predictive tests counting singleton traits onexternal data for the 328 and 200 word-lists. However, the eﬀect is removed when we reducethe data to the Swadesh 100 word-list, where traits are better attested. The eﬀect is tobias reconstructed branching times for gappy taxa to larger age values on the Ringe et al.(2002) 328 and 200 word-list data (see for example HT in Fig. 2). Acknowledgements

The authors acknowledge advice and assistance from Quentin Atkinson and David Welch ofthe University of Auckland, and ﬁnancial support from the Royal Society of New Zealand.

References

Atkinson, Q., G. Nicholls, D. Welch, and R. Gray (2005). From words to dates: water intowine, mathemagic or phylogenetic inference.

Transactions of the Philological Society 103 ,193–219.Bergsland, K. and H. Vogt (1962). On the validity of glottochronology.

Current Anthropol-ogy 3 , 115–153.Blust, R. (2000). Why lexicostatistics doesn’t work: the “universal constant” hypothesisand the austronesian languages. In C. Renfrew, A. McMahon, and L. Trask (Eds.),

Timedepth in historical linguistics , pp. 311–332. Cambridge, UK: The McDonald Institute forArchaeological Research.Drummond, A. J., G. K. Nicholls, A. G. Rodrigo, and W. Solomon (2002). Estimatingmutation parameters, population history and genealogy simultaneously from temporallyspaced sequnce data.

Genetics 161 , 1307–1320.Dyen, I., J. Kruskal, and B. Black (1992). An Indoeuropean classiﬁcation: a lexicostatisticalexperiment.

Transactions of the American Philosophical Society 82 (1-132). ncestral trees for binary trait data Dyen, I., J. Kruskal, and P. Black (1997). Electronic format from website of 3rd author,

Embleton, S. (1986).

Statistics in Historical Linguistics . Bochum: Brockmeyer.Erdem, E., V. Lifschitz, and D. Ringe (2006). Temporal phylogenetic networks and logicprogramming.

Theory and Practice of Logic Programming 6 , 539–558.Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihoodapproach.

J. Mol. Evol. 17 , 368–376.Felsenstein, J. (1992). Phylogenies from restriction sites: a maximum-likelihood approach.

Evolution 46 , 159–173.Garrett, A. (2006). Convergence in the formation of Indo-European subgroups: phylogenyand chronology. In P. Forster and C. Renfrew (Eds.),

Phylogenetic Methods and thePrehistory of Languages , pp. 139–152. Cambride, UK: The McDonald Institute for Ar-chaeological Research.Geyer, C. J. (1992). Practical Markov chain Monte Carlo (with discussion).

Statist. Sci. 7 ,473–511.Gray, R. and Q. Atkinson (2003). Language-tree divergence times support the Anatoliantheory of Indo-European origin.

Nature 426 , 435–439.Huson, D. and M. Steel (2004). Phylogenetic trees based on gene content.

Bioinformat-ics 20 (13), 2044–2049.Kimura, M. and J. Crow (1964). The number of alleles that can be maintained in a ﬁnitepopulation.

Genetics 49 , 725–738.Lees, R. (1953). On the basis of glottochronology.

Language 29 , 113–127.Lewis, P. (2001). A likelihood approach to estimating phylogeny from discrete morphologicalcharacter data.

Systematic Biology 50 , 913–925.Nakhleh, L., D. Ringe, and T. Warnow (2005). Perfect phylogenetic networks: A newmethodology for reconstructing the evolutionary history of natural languages.

LAN-GUAGE, Journal of the Linguistic Society of America 81 , 382–420.Nicholls, G. and R. Gray (2007). Dated ancestral trees from binary trait dataand its application to the diversiﬁcation of languages (supplementary material).

Nylander, J., F. Ronquist, J. Huelsenbeck, and J. Nieves-Aldrey (2004). Bayesian phyloge-netic analysis of combined data.

Systematic Biology , 47–67.Pagel, M. and A. Meade (2006). Estimating rates of lexical replacement on phylogenetictrees of languages. In P. Forster and C. Renfrew (Eds.),

Phylogenetic Methods and thePrehistory of Languages , pp. 173–181. Cambride, UK: The McDonald Institute for Ar-chaeological Research. Ringe, D., T. Warnow, and A. Taylor (2002). Indo-European and Computational Cladis-tics.

Transactions of the Philological Society 100 , 59–129. Data available from

Ronquist, F. and J. P. Huelsenbeck (2003). Mrbayes 3: Bayesian phylogenetic inferenceunder mixed models.

Bioinformatics 19 , 1572–1574.Saitou, N. and M. Nei (1987). The neighbor-joining method: a new method for reconstruct-ing phylogenetic trees.

Molecular Biology and Evolution 4 , 406–425.Sankoﬀ, D. (1973). Mathematical developments in lexicostatistic theory.

Current Trendsin Linguistics 11 , 93–113.Swadesh, M. (1952). Lexico-statistic dating of prehistoric ethnic contacts.

Proc. Am. Phil.Soc. 96 , 453–463.Warnow, T., S. Evans, D. Ringe, and L. Nakhleh (2006). A stochastic model of languageevolution that incorporates homoplasy and borrowing. In P. Forster and C. Renfrew(Eds.),

Phylogenetic Methods and the Prehistory of Languages , pp. 75–87. Cambride,UK: The McDonald Institute for Archaeological Research.Watterson, G. (1975). On the number of segregating sites in genetical models withoutrecombination.

Theoretical Population Biology 7 , 256–276.Yang, Z. and B. Rannala (1997). Bayesian phylogenetic inference using DNA sequences: aMarkov Chain Monte Carlo method.

Molecular Biology and Evolution 14 , 717–724.Yang, Z. and B. Rannala (2006). Bayesian estimation of species divergence times undera molecular clock using multiple fossil calibrations with soft bounds.