aa r X i v : . [ s t a t . A P ] N ov The Annals of Applied Statistics (cid:13)
Institute of Mathematical Statistics, 2008
ANALYSIS OF COMPARATIVE DATA WITHHIERARCHICAL AUTOCORRELATION
By C´ecile An´e
University of Wisconsin—Madison
The asymptotic behavior of estimates and information criteria inlinear models are studied in the context of hierarchically correlatedsampling units. The work is motivated by biological data collectedon species where autocorrelation is based on the species’ genealog-ical tree. Hierarchical autocorrelation is also found in many otherkinds of data, such as from microarray experiments or human lan-guages. Similar correlation also arises in ANOVA models with nestedeffects. I show that the best linear unbiased estimators are almostsurely convergent but may not be consistent for some parameterssuch as the intercept and lineage effects, in the context of Brownianmotion evolution on the genealogical tree. For the purpose of modelselection I show that the usual BIC does not provide an appropriateapproximation to the posterior probability of a model. To correct forthis, an effective sample size is introduced for parameters that are in-consistently estimated. For biological studies, this work implies thattree-aware sampling design is desirable; adding more sampling unitsmay not help ancestral reconstruction and only strong lineage effectsmay be detected with high power.
1. Introduction.
In many ecological or evolutionary studies, scientistscollect “comparative” data across biological species. It has long been rec-ognized [Felsenstein (1985)] that sampling units cannot be considered inde-pendent in this setting. The reason is that closely related species are ex-pected to have similar characteristics, while a greater variability is expectedamong distantly related species. “Comparative methods” accounting for an-cestry relationships were first developed and published in evolutionary biol-ogy journals [Harvey and Pagel (1991)], and are now being used in variousother fields. Indeed, hierarchical dependence structures of inherited traitsarise in many areas, such as when sampling units are genes in a gene family
Received December 2007; revised April 2008.
Key words and phrases.
Asymptotic convergence, consistency, linear model, depen-dence, comparative method, phylogenetic tree, Brownian motion evolution.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Applied Statistics ,2008, Vol. 2, No. 3, 1078–1102. This reprint differs from the original in paginationand typographic detail. 1
C. AN´E
Fig. 1.
Example of a genealogical tree from 4 units (left) and covariance matrix of vector Y under the Brownian motion model (right). [Gu (2004)], HIV virus samples [Bhattacharya et al. (2007)], human cul-tures [Mace and Holden (2005)] or languages [Pagel, Atkinson and Meade(2007)]. Such tree-structured units show strong correlation, in some waysimilar to the correlation encountered in spatial statistics. Under the spa-tial “infill” asymptotic where a region of space is filled in with denselysampled points, it is known that some parameters are not consistently es-timated [Zhang and Zimmerman (2005)]. It is shown here that inconsis-tency is also the fate of some parameters under hierarchical dependency.While spatial statistics is now a well recognized field, the statistical anal-ysis of tree-structured data has been mostly developed by biologists sofar. This paper deals with a classical regression framework used to analyzedata from hierarchically related sampling units [Martins and Hansen (1997),Housworth, Martins and Lynch (2004), Garland, Bennett and Rezende (2005),Rohlf (2006)]. Hierarchical autocorrelation.
Although species or genes in a gene fam-ily do not form an independent sample, their dependence structure derivesfrom their shared ancestry. The genealogical relationships among the unitsof interest are given by a tree (e.g., Figure 1) whose branch lengths representsome measure of evolutionary time, most often chronological time. The rootof the tree represents a common ancestor to all units considered in the sam-ple. Methods for inferring this tree typically use abundant molecular dataand are now extensively developed [Felsenstein (2004), Semple and Steel(2003)]. In this paper the genealogical tree relating the sampled units isassumed to be known without error.The Brownian model (BM) of evolution states that characters evolve onthe tree with a Brownian motion (Figure 2). After time t of evolution, thecharacter is normally distributed, centered at the ancestral value at time 0and with variance proportional to t . Each internal node in the tree depictsa speciation event: an ancestral lineage splitting into two new lineages. Thedescendant lineages inherit the ancestral state just prior to speciation. Eachlineage then evolves with an independent Brownian motion. The covariancematrix of the data at the n tips Y = ( Y , . . . , Y n ) is then determined by the IERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA tree and its branch lengths: Y ∼ N ( µ, σ V tree ) , where µ is the character value at the root of the tree. Components of V tree are the times of shared ancestry between tips, that is, V ij is the lengthshared by the paths from the root to the tips i and j (Figure 1). The samestructural covariance matrix could actually be obtained under other modelsof evolution, such as Brownian motion with drift, evolution by Gaussianjumps at random times or stabilizing selection in a random environment[Hansen and Martins (1996)]. The i.i.d. model is obtained with a “star”tree, where all tips are directly connected to the root by edges of identicallengths. Another model of evolution assumes an Ornstein–Uhlenbeck (OU)process and accounts for stabilizing selection [Hansen (1997)]. The presentpaper covers the assumption of a BM structure of dependence, althoughseveral results also apply to OU and other models. As the Brownian motionis reversible, the tree can be re-rooted. When the root is moved to a newnode in the tree, the ancestral state µ represents the state of the characterat that new node, so re-rooting the tree corresponds to a re-parametrization. The linear model.
A frequent goal is to detect relationships between twoor more characters or to estimate ancestral traits [Schluter et al. (1997),Pagel (1999), Garland and Ives (2000), Huelsenbeck and Bollback (2001),Blomberg, Garland and Ives (2003), Pagel, Meade and Barker (2004)]. Inthis paper I consider the linear model Y = X β + ε with ε ∼ N (0 , σ V tree )as derived from a BM evolution on the tree. When the matrix of predictors X is of full rank k , it is well known that the best linear unbiased estimatorfor β is ˆ β = ( X t V − X ) − X t V − Y . Fig. 2.
Simulation of BM evolution along the tree in Figure 1. Ancestral state was µ = 10 .Observed values of Y are marked by points. C. AN´E
Random covariates are typically assumed to evolve with a BM on the sametree as Y . Fixed covariates are also frequently considered, such as deter-mined by a subgroup of tips.Although this model has already been used extensively, the present pa-per is the first one to address its asymptotic properties. For a meaningfulasymptotic framework, it is assumed that the root of the tree is fixed whileunits are added to the sample. The reason is that the intercept relates tothe ancestral state at the root of the tree. If the root is pushed back in timeas tips are added to the tree, then the meaning of the intercept changesand there is no hope of consistency for the intercept. The assumption of afixed root is just a rooting requirement. It does not prevent any unit to besampled.Asymptotic results assume the sample size goes to infinity. I argue herethat this is relevant in real biological studies. For instance, studies on phylo-genetically related viral samples have included hundreds of samples[Bhattacharya et al. (2007)]. Pagel, Atkinson and Meade (2007) have builtand used a tree relating as many as 87 Indo-European languages. Manygroups count an incredibly large number of species. For instance, there areabout 20,000 orchid species to choose from [Dressler (1993)], over 10,000species of birds [Jønsson and Fjelds˚a (2006)], or about 200 wild potatospecies [Spooner and Hijmans (2001)]. In addition, studies can consider sub-populations and even individuals within species, so long as they are relatedby a divergent tree. Organization.
The main results are illustrated on real examples in Sec-tion 2. It is shown that ˆ β is convergent almost surely and in L norm inSection 3. In Section 4 then, I show that some components of ˆ β are not con-sistent, converging to some random value. This is typically the case of theintercept and of lineage effect estimators, while estimates of random covari-ate effects are consistent. I investigate a sampling strategy—unrealistic formost biological settings—where consistency can be achieved for the inter-cept in Section 4. With this sampling strategy, I show a phase transition forthe rate of convergence: if branches are not sampled close to the root of thetree fast enough, the rate of convergence is slower than the usual √ n rate.In Section 5 I derive an appropriate formula for the Bayesian InformationCriterion and introduce the concept of effective sample size. Applications tobiological problems are discussed in Section 6, as well as applications to abroader context of hierarchical models such as ANOVA.
2. Illustration of the main results.
Davis et al. (2007) analyzed flowersize diameter from n = 25 species. Based on the plants’ tree (Figure 3 left)assuming a simple BM motion with no shift, calculations yield an effectivesample size n e = 5 .
54 for the purpose of estimating flower diameter of the
IERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA ancestral species at the root. This is about a 4-fold decrease compared to thenumber of 25 species, resulting in a confidence interval over 2 times widerthan otherwise expected from n = 25 i.i.d. sampling units. The analysis of alarger tree with 49 species [Garland et al. (1993)] shows an 8-fold decreasewith n e = 6 .
11. Section 4 shows this is a general phenomenon: increasing thesample size n cannot push the effective sample size n e associated with theestimation of ancestral states beyond some upper bound. More specifically,Section 4 shows that n e ≤ kT /t , where k is the number of edges stemmingfrom the root, t is the length of the shortest of these edges and T is thedistance from the root to the tips (or its average value). To account forautocorrelation, Paradis and Claude (2002) introduced a degree of freedomdf P = L/T , where L is the sum of all branch lengths. Interestingly, n e isnecessarily smaller than df P when all tips of the tree are at equal distance T from the root (see Appendix A).Unexpectedly large confidence intervals are already part of biologists’ ex-perience [Schluter et al. (1997)]. As Cunningham, Omland and Oakley (1998)put it, likelihood methods have “revealed a surprising amount of uncertaintyin ancestral reconstructions” to the point that authors may be tempted toprefer methods that do not report confidence intervals [McArdle and Rodrigo(1994)] or to ignore autocorrelation due to shared ancestry [Martins (2000)].Still, reconstructing ancestral states or detecting unusual shifts between twoancestors are very frequent goals. For example, Hansen (1997) hypothesizeda shift in tooth size to have occurred along the ancient lineage separat-ing browsing horses and grazing horses. Recent micro-array data from genefamilies have inferred ancestral expression patterns, as well as shifts thatpossibly occurred after genes were duplicated [Gu (2004)]. Guo et al. (2007)have estimated shifts in brain growth along the human lineage and alongthe lineage ancestral to human/chimp. Sections 3 and 4 show that under theBM model ancestral reconstructions and shift estimates are not consistent,but are instead convergent toward a random limit. This is illustrated bysmall effective sample sizes associated with shift estimators. Among the 25plant species sampled by Davis et al. (2007), 3 parasitic Rafflesiaceae specieshave gigantic flowers (in bold in Figure 3). Under a BM model with a shifton the
Rafflesiaceae lineage, the effective sample sizes for the root’s state( n e = 3 .
98) and for the shift ( n e = 2 .
72) are obtained from the
Rafflesiaceae subtree and the remaining subtree. These low effective sample sizes suggestthat only large shifts can be detected with high power.The potential lack of power calls for optimal sampling designs. Trees aretypically built from abundant and relatively cheap molecular sequence data.More and more often, a tree comprising many tips is available, while traitsof interest cannot be collected from all tips on the tree. A choice has tobe made on which tips should be kept for further data collection. Untilrecently, investigators did not have the tree at hand to make this choice,
C. AN´E
Fig. 3.
Phylogenetic trees from Davis et al. (2007) with 25 plant species, n e = 5 . (left)and from Garland et al. (1993) with 49 mammal species, n e = 6 . (right). Bottom: effec-tive sample size n e for sub-samples of a given size. Vertical bars indicate 95% confidenceinterval and median n e values when tips are selected at random from the plant tree (left)and mammal tree (right). Dots indicate optimal n e values. but now most investigators do. Therefore, optimal sampling design shoulduse information from the tree. Figure 3 shows the effective sample size n e associated with the root’s state in the simple BM model. First, sub-sampleswere formed by randomly selecting tips and n e was calculated for each sub-sample. Since there can be a huge number of combinations of tips, 1000random sub-samples of size k were generated for each k . Median and 95%confidence intervals for n e values are indicated by vertical bars in Figure3. Second, the sub-samples of a size k that maximize the effective samplesize n e were obtained using step-wise backward and forward searches. Both IERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA search strategies agreed on the same maximal n e values, which are indicatedwith dots in Figure 3. From both trees, only 15 tips suffice to obtain anear maximum effective sample size, provided that the selected tips arewell chosen, not randomly. The proposed selection of tips maximizes n e and is based on the phylogeny only, prior to data collection. In view of thebound for n e mentioned above, the selected tips will tend to retain the k edges stemming from the root and to minimize the length of these edges byretaining as many of the early branching lineages as possible.For the purpose of model selection, BIC is widely used [Schwarz (1978),Kass and Raftery (1995), Butler and King (2004)] and is usually defined as − L ( ˆ β, ˆ σ ) + p log( n ), where L ( ˆ β, ˆ σ ) is the maximized likelihood of themodel, p the number of parameters and n the number of observations. Eachparameter in the model is thus penalized by a log( n ) term. Section 6 showsthat this formula does not provide an approximation to the model posteriorprobability. Instead, the penalty associated with the intercept and with ashift should be bounded, and log(1 + n e ) is an appropriate penalty to be usedfor each inconsistent parameter. On the plant tree, the intercept (ancestralvalue) should therefore be penalized by log(1+ 5 .
54) in the simple BM model.In the BM model that includes a shift along the parasitic plant lineage, theintercept should be penalized by ln(1 + 3 .
98) and the shift by ln(1 + 2 .
3. Convergence of estimators.
This section proves the convergence ofˆ β = ˆ β ( n ) as the sample size n increases. The assumption of a fixed rootimplies that the covariance matrix V tree = V n (indexed by the sample size)is a submatrix of V n +1 . Theorem 1.
Consider the linear model Y i = X i β + ε i with ε ( n ) = ( ε , . . . , ε n ) t ∼ N (0 , σ V n ) and where predictors X may be either fixed or random. Assume the designmatrix X ( n ) (with X i for i th row) is of full rank provided n is large enough.Then the estimator ˆ β n = ( X ( n ) t V n − X ( n ) ) − X ( n ) t V n − Y ( n ) is convergentalmost surely and in L . Component ˆ β n,j converges to the true value β j if and only if its asymptotic variance is zero. Otherwise, it converges to arandom variable ˆ β ∗ j , which depends on the tree and the actual data. Note that no assumption is made on the covariance structure V n , exceptthat it is a submatrix of V n +1 . Therefore, Theorem 1 holds regardless of howthe sequence V n is selected. For instance, it holds for the OU model, whosecovariance matrix has components V ij = e − αd ij or V ij = (1 − e − αt ij ) e − αd ij (depending whether the ancestral state is conditioned upon or integrated C. AN´E out), where d ij is the tree distance between tips i and j , and α is the knownselection strength.Theorem 1 can be viewed as a strong law of large numbers: in the absenceof covariates and in the i.i.d. case ˆ β n is just the sample mean. Here, inthe absence of covariates ˆ β n is a weighted average of the observed values,estimating the ancestral state at the root of the tree. Sampling units close tothe root could be provided by fossil species or by early viral samples whensampling spans several years. Such units, close to the root, weigh more inˆ β n than units further away from the root. Theorem 1 gives a law of largenumber for this weighted average. However, we will see in Section 4 that thelimit is random: ˆ β n is inconsistent. Proof of Theorem 1.
The process ε = ( ε , ε , . . . ) is well defined ona probability space Ω because the covariance matrix V n is a submatrix of V n +1 . Derivations below are made conditional on the predictors X . In aBayesian-like approach, the probability space is expanded to e Ω = R k × Ω byconsidering β ∈ R k as a random variable, independent of errors ε . Assumea priori that β is normally distributed with mean 0 and covariance matrix σ I k , I k being the identity matrix of size k . Let F n be the filtration gener-ated by Y , . . . , Y n . Since β, Y , Y , . . . is a Gaussian process, the conditionalexpectation E ( β |F n ) is a linear combination of Y , . . . , Y n up to a constant: E ( β |F n ) = a n + M n Y ( n ) . The almost sure converge of ˆ β n will follow from the almost sure convergenceof the martingale E ( β |F n ) and from identifying M n Y ( n ) with a linear trans-formation of ˆ β n . The vector a n and matrix M n are such that E ( β |F n ) is theprojection of β on F n in L ( e Ω), that is, these coefficients are such thattrace( E ( β − a n − M n Y ( n ) )( β − a n − M n Y ( n ) ) t )is minimum. Since Y i = X i β + ε i , β is centered and independent of ε , we getthat a n = 0 and the quantity to be minimized istr(( I k − M n X ( n ) ) var( β )( I k − M n X ( n ) ) t ) + tr( M n var( ǫ ( n ) ) M tn ) . The matrix M n appears in the first term through M n X ( n ) , so we can mini-mize σ tr( M n V n M tn ) under the constraint that B = M n X ( n ) is fixed. UsingLagrange multipliers, we get M n V n = Λ X ( n ) t subject to M n X ( n ) = B . As-suming X ( n ) t V − n X ( n ) is invertible, it follows Λ = B ( X ( n ) t V − n X ( n ) ) − and M n Y ( n ) = B ˆ β ( n ) . The minimum attained is then σ tr( B ( X ( n ) t V − n X ( n ) ) − B t ).This is necessarily smaller than σ tr( MV n M t ) when M is formed by M n − and an additional column of zeros. So for any B , the trace of B ( X ( n ) t V − n X ( n ) ) − B t IERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA is a decreasing sequence. Since it is also nonnegative, it is convergent andso is ( X ( n ) t V − n X ( n ) ) − . Now the quadratic expressiontr(( I k − B )( I k − B ) t ) + tr( B ( X ( n ) t V − n X ( n ) ) − B t )is minimized if B satisfies B ( I k + ( X ( n ) t V − n X ( n ) ) − ) = I k . Note the symmet-ric definite positive matrix I k + ( X ( n ) t V − n X ( n ) ) − was shown above to bedecreasing with n . In summary, E ( β |F n ) = ( I k + ( X ( n ) t V − n X ( n ) ) − ) − ˆ β ( n ) . This martingale is bounded in L ( e Ω) so it converges almost surely and in L ( e Ω) to E ( β |F ∞ ). Finally, ˆ β ( n ) − β = ( I k + ( X ( n ) t V − n X ( n ) ) − ) E ( β |F n ) − β is also convergent almost surely and in L ( e Ω). But ˆ β n − β is a function of ω in the original probability space Ω, independent of β . Therefore, for any β , ˆ β ( n ) converges almost surely and in L (Ω). Since ε is a Gaussian process,the limit of ˆ β ( n ) is normally distributed with covariance matrix the limit of( X ( n ) t V − n X ( n ) ) − . It follows that ˆ β ( n ) k , which is unbiased, converges to thetrue β k if and only if the k th diagonal element of ( X ( n ) t V − n X ( n ) ) − goes to0. (cid:3)
4. Consistency of estimators.
In this section I prove bounds on the vari-ance of various effects ˆ β i . From Theorem 1 we know that ˆ β i is stronglyconsistent if and only if its variance goes to zero.4.1. Intercept.
Assume here that the first column of X is the column of ones, and the first component of β , the intercept, is denoted β . Proposition 2.
Let k be the number of daughters of the root node,and let t be the length of the shortest branch stemming from the root. Then var( ˆ β ) ≥ σ t/k . In particular, when the tree is binary we have var( ˆ β ) ≥ σ t/ . The following inconsistency result follows directly.
Corollary 3.
If there is a lower bound t > on the length of branchesstemming from the root, and an upper bound on the number of branchesstemming from the root, then ˆ β is not a consistent estimator of the intercept,even though it is unbiased and convergent. The conditions above are very natural in most biological settings, sincemost ancient lineages have gone extinct. The lower bound may be pusheddown if abundant fossil data is available or if there has been adaptive radi-ation with a burst of speciation events at the root of the tree. C. AN´E
Proof of Proposition 2.
Assuming the linear model is correct, thevariance of ˆ β is given by var( ˆ β ) = σ ( X t V − X ) − , where the first columnof X is the vector of ones, so that the variance of the intercept estimatoris just the first diagonal element of σ ( X t V − X ) − . But ( X t V − X ) − ii ≥ ( X it V − X i ) − for any index i [Rao (1973), 5a.3, page 327], so the proof canbe reduced to the simplest case with no covariates: Y i = β + ε i . The basicidea is that the information provided by all the tips on the ancestral state β is no more than the information provided just by the k direct descendantsof the root. Let us consider Z , . . . , Z k to be the character states at the k branches stemming from the root after a time t of evolution (Figure 4, left).These states are not observed, but the observed values Y , . . . , Y n haveevolved from Z , . . . , Z k . Now I claim that the variance of ˆ β obtained fromthe Y values is no smaller than the variance of ˆ β ( z )0 obtained from the Z values. Since the Z values are i.i.d. Gaussian with mean β and variance σ t , ˆ β ( z )0 has variance σ t/k . To prove the claim, consider β ∼ N (0 , σ )independent of ǫ . Then E ( β | Y , . . . , Y n , Z , . . . , Z k ) = E ( β | Z , . . . , Z k ) sothat var( E ( β | Y , . . . , Y n )) ≤ var( E ( β | Z , . . . , Z k )). The proof of Theorem1 shows that E ( β | Y , . . . , Y n ) = ˆ β / (1 + t y ) where t y = ( t V − ) − so, sim-ilarly, E ( β | Z , . . . , Z k ) = ˆ β ( z )0 / (1 + t z ) where t z = t/k . Since β and ˆ β − β are independent, the variance of E ( β | Y , . . . , Y n ) is ( σ + t y σ ) / (1 + t y ) = σ / (1 + t y ). The variance of E ( β | Z , . . . , Z k ) is obtained similarly and we get1 / (1 + t y ) ≤ / (1 + t z ), that is, t y ≥ t z and var( ˆ β ) = σ ( t V − ) − ≥ σ t/k . (cid:3) Lineage effect.
This section considers a predictor X that definesa subtree, that is, X i = 1 if tip i belongs to the subtree and 0 otherwise.This is similar to a 2-sample comparison problem. The typical “treatment”effect corresponds here to a “lineage” effect, the lineage being the branchsubtending the subtree of interest. If a shift occurred along that lineage, Fig. 4.
Left: Observed states are Y , . . . , Y n , while Z , . . . , Z k are the unobserved statesalong the k edges branching from the root, after time t of evolution. Y provides lessinformation on β than Z . Right: Z , Z , . . . , Z k top are unobserved states providing moreinformation on the lineage effect β than the observed Y values. IERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA Fig. 5.
Model M (left) and M (right) with a lineage effect. X is the indicator of asubtree. Model M conditions on the state at the subtree’s root, modifying the dependencestructure. tips in the subtree will tend to have, say, high trait values relative to theother tips. However, the BM model does predict a change, on any branchin the tree. So the question is whether the actual shift on the lineage ofinterest is compatible with a BM change, or whether it is too large to besolely explained by Brownian motion. Alternatively, one might just estimatethe actual change.This consideration leads to two models. In the first model, a shift β = β ( S )top is added to the Brownian motion change along the branch of interest,so that β ( S )top represents the character displacement not due to BM noise.In the second model, β = β ( SB )top is the actual change, which is the sumof the Brownian motion noise and any extra shift. Observations are thenconditioned on the actual ancestral states at the root and the subtree’s root(Figure 5). By the Markov property, observations from the two subtreesare conditionally independent of each other. In the second model then, thecovariance matrix is modified. The models are written Y = β + X β + · · · + X k β k + ε with β = β ( S )top and ε ∼ N (0 , σ V tree ) in the first model, while β = β ( SB )top and ε ∼ N (0 , σ diag( V top , V bot )) in the second model, where V top and V bot are the covariance matrices associated with the top and bottom subtreesobtained by removing the branch subtending the group of interest (Figure5). Proposition 4.
Let k top be the number of branches stemming from thesubtree of interest, t top the length of the shortest branch stemming from theroot of this subtree, and t the length of the branch subtending the subtree.Then var( ˆ β ( S )top ) ≥ σ ( t + t top /k top ) and var( ˆ β ( SB )top ) ≥ σ t top /k top . Therefore, if t top /k top remains bounded from below when the sample sizeincreases, both estimators ˆ β ( S )top (pure shift) and ˆ β ( SB )top (actual change) areinconsistent. C. AN´E
From a practical point of view, unless fossil data is available or there wasa radiation (burst of speciation events) at both ends of the lineage, shiftestimators are not consistent. Increasing the sample size might not helpdetect a shift as much as one would typically expect.Note that the pure shift β ( S )top is confounded with the Brownian noise, soit is no wonder that this quantity is not identifiable as soon as t >
0. Theadvantage of the first model is that the BM with no additional shift is nestedwithin it.
Proof of Proposition 4.
In both models var( ˆ β ) is the second diago-nal element of σ ( X t V − X ) − which is bounded below by σ ( X t V − X ) − ,so that we need just prove the result in the simplest model Y = β X + ε .Similarly to Proposition 2, define Z , . . . , Z k top as the character states at the k top direct descendants of the subtree’s root after a time t top of evolution.Also, let Z be the state of node just parent to the subtree’s root (see Fig-ure 4, right). Like in Proposition 2, it is easy to see that the variance ofˆ β given the Y is larger than the variance of ˆ β given the Z , Z , . . . , Z k top .In the second model, β = β ( SB )top is the actual state at the subtree’s root,so Z , . . . , Z k top are i.i.d. Gaussian centered at β ( SB )top with variance σ t top and the result follows easily. In the first model, the state at the subtree’sroot is the sum of Z , β ( S )top and the BM noise along the lineage, so ˆ β ( S )top =( Z + · · · + Z k top ) /k top − Z . This estimate is the sum of β ( S )top , the BM noiseand the sampling error about the subtree’s root. The result follows becausethe BM noise and sampling error are independent with variance σ t and σ t top /k top respectively. (cid:3) Variance component.
In contrast to the intercept and lineage effects,inference on the rate σ of variance accumulation is straightforward. Anunbiased estimate of σ isˆ σ = RSS / ( n − k ) = ( b Y − Y ) t V − ( b Y − Y ) / ( n − k ) , where b Y = X ˆ β are predicted values and n is the number of tips. The classicalindependence of ˆ σ and ˆ β still holds for any tree, and ( n − k )ˆ σ /σ followsa χ n − k distribution, k being the rank of X . In particular, ˆ σ is unbiasedand converges to σ almost surely as the sample size increases, as shownin Appendix B. Although not surprising, this behavior contrasts with theinconsistency of the intercept and lineage effect estimators. We keep in mind,however, that the convergence of ˆ σ may not be robust to a violation of thenormality assumption or to a misspecification of the dependence structure,either from a inadequate model (BM versus OU) or from an error in thetree. IERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA Random covariate effects.
In this section X denotes the matrix ofrandom covariates, excluding the vector of ones or any subtree indicator. Inmost cases it is reasonable to assume that random covariates also follow aBrownian motion on the tree. Covariates may be correlated, accumulatingcovariance t Σ on any single edge of length t . Then covariates j and k havecovariance Σ jk V tree . With a slight abuse of notation (considering X as asingle large vector), var( X ) = Σ ⊗ V tree . Proposition 5.
Consider Y = β + X β + ε with ε ∼ N (0 , σ V tree ) .Assume X follows a Brownian evolution on the tree with nondegeneratecovariance Σ : X ∼ N ( µ X , Σ ⊗ V tree ) . Then var( ˆ β ) ∼ σ Σ − /n asymptot-ically. In particular, ˆ β estimates β consistently by Theorem 1. Randomcovariate effects are consistently and efficiently estimated, even though theintercept is not. Proof.
We may write V − = R t R using the Cholesky decompositionfor example. Since R1 = 0, we may find an orthogonal matrix O such that OR1 = ( a, , . . . , t for some a , so without loss of generality, we may assumethat R1 = ( a, , . . . , t . The model now becomes RY = R1 β + RX β + R ε , where errors R ε are now i.i.d. Let e X be the first row of RX andlet e X be the matrix made of all but the first row of RX . Similarly, let(˜ y , e Y t ) t = RY and (˜ ε , ˜ ε t ) t = R ε . The model becomes e Y = e X β + ˜ ε ,˜ y = aβ + e X β + ˜ ε with least square solution ˆ β = ( e X t e X ) − e X t e Y = β + ( e X t e X ) − e X t ˜ ε and ˆ β = (˜ y − e X ˆ β ) /a . The variance of ˆ β conditionalon X is then σ ( e X t e X ) − . Using the condition on R1 , the rows of e X arei.i.d. centered Gaussian with variance-covariance Σ and ( e X t e X ) − has an in-verse Wishart distribution with n − β ) is then σ E ( e X t e X ) − = σ Σ − / ( n − k − k is the number of random covariates, whichcompletes the proof. (cid:3) Remark.
The result still holds if one or more lineage effects are in-cluded and if the model conditions upon the character state at each subtree(second model in Section 4.2). The reason is that data from each subtree areindependent, and in each subtree the model has just an intercept in additionto the random covariates.The behavior of random effect estimators contrasts with the behavior ofthe intercept or lineage effect estimators. An intuitive explanation might bethe following. Each cherry in the tree (pair of adjacent tips) is a pair ofsiblings. Each pair provides independent evidence on the change of Y andof X between the 2 siblings, even though parental information is unavail-able. Even though means of X and Y are poorly known, there is abundant C. AN´E evidence on how they change with each other. Similarly, the method of inde-pendent contrasts [Felsenstein (1985)] identifies n −
5. Phase transition for symmetric trees.
The motivation for this sectionis to determine the behavior of the intercept estimator when branches canbe sampled closer and closer to the root. I show that the intercept can beconsistently estimated, although the rate of convergence can be much slowerthan root n . The focus is on a special case with symmetric sampling (Figure6). The tree has m levels of internal nodes with the root at level 1. Allnodes at level i share the same distance from the root t + · · · + t i − and thesame number of descendants d i . In a binary tree all internal nodes have 2descendants and the sample size is n = 2 m . The total tree height is set to 1,that is, t + · · · + t m = 1.With these symmetries, the eigenvalues of the covariance matrix V n can be completely determined (see Appendix C), smaller eigenvalues be-ing associated with shallower internal nodes (close to the tips) and largereigenvalues being associated with more basal nodes (close to the root).In particular, the constant vector is an eigenvector and ( t V n ) − = t /d + · · · + t m / ( d . . . d m ).In order to sample branches close to the root, consider replicating themajor branches stemming from the root. Specifically, a proportion q of eachof these d branches is kept as is by the root, and the other proportion 1 − q is replicated along with its subtending tree (Figure 6), that is, t ( m )1 = q m − and t ( m ) i = (1 − q ) q m − i for i = 2 , . . . , m . For simplicity, assume further thatgroups are replicated d ≥ d = · · · = d m = d . Fig. 6.
Symmetric sampling (left) and replication of major branches close to the root(right).
IERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA The result below shows a law of large numbers and provides the rate ofconvergence.
Proposition 6.
Consider the model with an intercept and random co-variates Y = β + X β + ε with ε ∼ N (0 , σ V n ) on the symmetric treedescribed above. Then ˆ β is consistent. The rate of convergence experiencesa phase transition depending on how close to the root new branches areadded: var( ˆ β ) is asymptotically proportional to n − if q < /d , ln( n ) n − if q = 1 /d or n α if q > /d where α = ln( q ) / ln( d ) . Therefore, the root- n rate ofconvergence is obtained as in the i.i.d. case if q < /d . Convergence is muchslower if q > /d . Proof.
By Theorem 1, the consistency of ˆ β follows from its variancegoing to 0. First consider the model with no covariates. Up to σ , the vari-ance of ˆ β is ( t V n ) − = t /d + · · · + t m / ( d . . . d m ), which is q m − /d +(1 − q )(1 − ( qd ) m − ) / ( d m (1 − qd )) if qd = 1 and (1 + (1 − q )( m − /d m if qd = 1. The result follows easily since n = d m , m ∝ ln( n ) and q m = n α .In the presence of random covariates, it is easy to see that the variance ofˆ β is increased by var(ˆ µ X ( ˆ β − β )), where ˆ µ X = e X /a is the row vector ofthe covariates’ estimated ancestral states (using notations from the proof ofProposition 5). By Proposition 5 this increase is O ( n − ), which completesthe proof. (cid:3)
6. Bayesian information criterion.
The basis for using BIC in modelselection is that it provides a good approximation to the marginal modelprobability given the data and given a prior distribution on the parameterswhen the sample size is large. The proof of this property uses the i.i.d.assumption quite heavily, and is based on the likelihood being more andmore peaked around its maximum value. Here, however, the likelihood doesnot concentrate around its maximum value since even an infinite sample sizemay contain little information about some parameters in the model. Thefollowing proposition shows that the penalty associated with the interceptor with a lineage effect ought to be bounded, thus smaller than log( n ). Proposition 7.
Consider k random covariates X with Brownian evo-lution on the tree and nonsingular covariance Σ , and the linear models Y = β + X β + ε with ε ∼ N (0 , σ V tree ) (M ) Y = β + X β + β top top + ε with ε ∼ N (0 , σ V tree ) , (M ) where the lineage factor top is the indicator of a (top) subtree. Assume asmooth prior distribution π over the parameters θ = ( β, σ ) and a samplingsuch that t V − n is bounded, that is, branches are not sampled too close C. AN´E from the root. With model M assume further that branches are not sampledtoo close from the lineage of interest, that is, t top V − n top is bounded. Thenfor both models, the marginal probability of the data P ( Y ) = R P ( Y | θ ) π ( θ ) dθ satisfies − P ( Y ) = − L (ˆ θ ) + ( k + 1) ln( n ) + O (1) as the sample size increases. Therefore, the penalty for the intercept and fora lineage effect is bounded as the sample size increases. The poorly estimated parameters are not penalized as severely as theconsistently estimated parameters, since they lead to only small or moderateincreases in likelihood. Also, the prior information continues to influence theposterior of the data even with a very large sample size. Note that the lineageeffect β top may either be the pure shift or the actual change. Model M isnested within M in the first case only.In the proof of Proposition 7 (see Appendix D) the O (1) term is shownto be dominated by C = log det ˆ Σ − ( k + 1) log(2 π ˆ σ ) + log 2 + D, where D depends on the model. In M D = − Z β exp( − ( β − ˆ β ) / (2 t ˆ σ )) π ( β , ˆ β , ˆ σ ) dβ , (1)where t = lim( t V − n ) − . In M D = − Z β ,β top exp( − ˜ β t W − ˜ β/ (2ˆ σ )) π ( β , β top , ˆ β , ˆ σ ) dβ dβ top , (2)where ˜ β t = ( β − ˆ β , β top − ˆ β top ) and the 2 × W − hasdiagonal elements lim t V − n = t − , lim t top V − n top < ∞ and off-diagonalelement lim t V − n top , which does exist.In the rest of the section I assume that all tips are at the same distance T from the root. This condition is realized when branch lengths are chrono-logical times and tips are sampled simultaneously. Under BM, Y , . . . , Y n have common variance σ T . The ancestral state at the root is estimatedwith asymptotic variance σ / lim n t V − n , while the same precision wouldbe obtained with a sample of n e independent variables where n e = T lim n t V − n . Therefore, I call this quantity the effective sample size associated with theintercept.The next proposition provides more accuracy for the penalty term in casethe prior has a specific, reasonable form. In some settings, it has been shown
IERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA that the error term in the BIC approximation is actually better than O (1).Kass and Wasserman (1995) show this error term is only O ( n − / ) if theprior carries the same amount of information as a single observation would,as well as in the context of comparing nested models with an alternative hy-pothesis close to the null. I follow Kass and Wasserman (1996) and considera “reference prior” that contains little information, like a single observationwould [see also Raftery (1995, 1996), Wasserman (2000)]. In an empiricalBayes way, assume the prior is Gaussian centered at ˆ θ . Let ( β , σ ) haveprior variance J − n = diag(ˆ σ ˆ Σ − , ˆ σ /
2) and be independent of the otherparameter(s) β and β top . Also, let β have variance ˆ σ T in model M .In model M , assume further that the tree is rooted at the base of thelineage of interest, so that the intercept is the ancestral state at the baseof that lineage. This reparametrization has the advantage that ˆ β and ˆ β +ˆ β top are uncorrelated asymptotically. A single observation from outside thesubtree of interest (i.e., from the bottom subtree) would be centered at β with variance σ T , while a single observation from the top subtree wouldbe centered at β + β top with variance σ T top . In case β top is the pure shift,then T top = T . If β top is the actual change along the lineage, then T top isthe height of the subtree excluding its subtending branch. Therefore, it isreasonable to assign ( β , β top ) a prior variance of ˆ σ W π with W π = (cid:18) T − T − T T + T top (cid:19) . The only tips informing β + β top are those in the top subtree and the onlyunits informing β are those in the bottom subtree. Therefore, the effectivesample sizes associated with the intercept and lineage effects are defined as n e, bot = T lim n t V − , n e, top = T top lim n t V − , where V bot and V top are the variance matrices from the bottom and topsubtrees. Proposition 8.
Consider models M and M and the prior specifiedabove. Then P ( Y | M ) = − L (ˆ θ | M ) + ( k + 1) ln( n ) + ln(1 + n e ) + o (1) and P ( Y | M ) = − L (ˆ θ | M )+ ( k + 1) ln( n )+ ln(1+ n e, bot )+ ln(1+ n e, top )+ o (1) .Therefore, a reasonable penalty for the nonconsistently estimated parametersis the log of their effective sample sizes plus one. Proof.
With model M , we get from (1) D = − π ( ˆ β , ˆ σ ) − Z exp (cid:18) − ( β − ˆ β ) t ˆ σ − ( β − ˆ β ) T ˆ σ (cid:19) dβ √ πT ˆ σ = − π ( ˆ β , ˆ σ ) + log(1 + T /t ) . C. AN´E
Now − π ( ˆ β , ˆ σ ) = ( k + 1) log(2 π ) − log det J n cancels with the first con-stant terms to give C = log(1 + T /t ) = log(1 + n e ). With model M , weget D = − π ( ˆ β, ˆ σ ) − W − + W − π ) − / det W / π , so that again C = log(det( W − + W − π ) det W π ). It remains to identify thisquantity with ln(1 + n e, bot ) + ln(1 + n e, top ). It is easy to see that det W π = T T top and W − π = (cid:18) T − + T − T − T − T − (cid:19) . Since V is block diagonal diag( V top , V bot ), we have that t V − top = n e, top /T top and t V − = t V − + t V − = n e, top /T top + n e, bot /T. Therefore, W − has diagonal terms n e, top /T top + n e, bot /T and n e, bot /T top and off-diagonalterm n e, bot /T top . We get det( W − + W − π ) = ( n e, bot + 1) /T top ( n e, bot + 1) /T ,which completes the proof. (cid:3) Akaike’s information criterion (AIC).
This criterion [Akaike (1974)] isalso widely used for model selection. With i.i.d. samples, AIC is an estimateof the Kullback–Leibler divergence between the true distribution of the dataand the estimated distribution, up to a constant [Burnham and Anderson(2002)]. Because of the BM assumption, the Kullback–Leibler divergencecan be calculated explicitly. Using the Gaussian distribution of the data, themutual independence of ˆ σ and ˆ β and the chi-square distribution of ˆ σ , theusual derivation of AIC applies. Contrary to BIC, the AIC approximationstill holds with tree-structure dependence.
7. Applications and discussion.
This paper provides a law of large num-bers for non i.i.d. sequences, whose dependence is governed by a tree struc-ture. Almost sure convergence is obtained, but the limit may or may notbe the expected value. With spatial or temporal data, the correlation de-creases rapidly with spatial distance or with time typically (e.g., AR pro-cesses) under expanding asymptotics. With a tree structure, the dependenceof any 2 new observations from 2 given subtrees will have the same corre-lation with each other as with “older” observations. In spatial statistics,infill asymptotics also harbor a strong, nonvanishing correlation. This de-pendence implies a bounded effective sample size n e in most realistic bio-logical settings. However, I showed that this effective sample size pertainsto locations parameters only (intercept, lineage effects). Inconsistency hasalso been described in population genetics. In particular, Tajima’s estima-tor of the level of sequence diversity from a sample of n individuals is not IERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA consistent [Tajima (1983)], while asymptotically optimal estimators onlyconverge at rate log( n ) rather than n [Fu and Li (1993)]. The reason is thatthe genealogical correlation among individuals in the population decreasesthe available information. Sampling design.
Very large genealogies are now available, with hun-dreds or thousands of tips [Cardillo et al. (2005), Beck et al. (2006)]. It isnot uncommon that physiological, morphological or other phenotypic datacannot be measured for all units in the group of interest. For the purposeof estimating an ancestral state, the sampling strategy suggested here max-imizes the scaled effective sample size t V − n over all subsamples of size n ,where n is an affordable number of units to subsample. This criterion is afunction of the rooted tree topology and its branch lengths. It is very easy tocalculate with one tree traversal using Felsenstein’s algorithm [Felsenstein(1985)], without inverting V n . It might be computationally too costly to as-sess all subsamples of size n , but one might heuristically search only amongthe most star-like subtrees. Backward and forward stepwise search strategieswere implemented, either removing or adding tips one at a time. Desperate situation?
This paper provides a theoretical explanation forthe known difficulty of estimating ancestral states. In terms of detecting non-Brownian shifts, our results imply that the maximum power cannot reach100%, even with infinite sampling. Instead, what mostly drives the power ofshift detection is the effect size: β / √ tσ where β is the shift size and t is thelength of the lineage experiencing the shift. The situation is desperate onlyin cases when the effect size is small. Increased sampling may not providemore power. Beyond the Brownian motion model.
The convergence result applies toany dependence matrix. Bounds on the variance of estimates do not applyto the Ornstein–Uhlenbeck model, so it would be interesting to study theconsistency of estimates in this model. Indeed, when selection is strong theOU process is attracted to the optimal value and “forgets” the initial valueexponentially fast. Several studies have clearly indicated that some ancestralstates and lineage-specific optimal values are not estimable [Butler and King(2004), Verd´u and Gleiser (2006)], thus bearing on the question of how effi-ciently these parameters can be estimated. While the OU model is alreadybeing used, theoretical questions remain open.
Broader hierarchical autocorrelation context.
So far linear models wereconsidered in the context of biological data with shared ancestry. However,implications of this work are far reaching and may affect common practices C. AN´E
Fig. 7.
Trees associated with ANOVA models: 3 groups with fixed effects (left) or randomeffects (right). Variance within and among groups are σ e and σ a respectively. in many fields, because tree structured autocorrelation underlies many ex-perimental designs. For instance, the typical ANOVA can be representedby a forest (with BM evolution), one star tree for each group (Figure 7). Ifgroups have random effects, then a single tree captures this model (Figure7). It shows visually how the variation decomposes into within and amonggroup variation. ANOVA with several nested effects would be representedby a tree with more hierarchical levels, each node in the tree representinga group. In such random (or mixed) effect models, asymptotic results areknown when the number of groups becomes large, while the number of unitsper group is not necessarily required to grow [Akritas and Arnold (2000),Wang and Akritas (2004), G¨uven (2006)]. The results presented here pertainto any kind of tree growth, even when group sizes are bounded. Model selection.
Many aspects of the model can be selected for, suchas the most important predictors or the appropriate dependence structure.Moreover, there often is some uncertainty in the tree structure or in themodel of evolution. Several trees might be obtained from molecular data onseveral genes, for instance. These trees might have different topologies orjust different sets of branch lengths. BIC values from several trees can becombined for model averaging. I showed in this paper that the standard formof BIC is inappropriate. Instead, I propose to adjust the penalty associatedto an estimate with its effective sample size. AIC was shown to be stillappropriate for approximating the Kullback–Leibler criterion.
Open questions.
It was shown that the scaled effective sample size isbounded as long as the number k of edges stemming from the root is boundedand their lengths are above some t >
0. The converse is not true in general.Take a star tree with edges of length n . Then Y n ∼ N ( µ, σ n ) are inde-pendent, and t V − n = P n − is bounded. However, if one requires that thetree height is bounded (i.e., tips are distant from the root by no more thana maximum amount), then is it necessary to have k < ∞ and t > IERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA APPENDIX A: UPPER BOUND FOR THE EFFECTIVE SAMPLESIZEI prove here the claim made in Section 2 that the effective sample sizefor the intercept n e = T t V − is bounded by df P = L/T , where L is thetree length (the sum of all branch lengths), in case all tips are at equaldistance T from the root. It is easy to see that V is block diagonal, eachblock corresponding to one subtree branching from the root. Therefore, V − is also block diagonal and, by induction, we only need to show that n e ≤ L/T when the root is adjacent to a single edge. Let t be the length of this edge.When this edge is pruned from the tree, one obtains a subtree of length L − t and whose tips are at distance T − t from the root. Let V − t be the covariancematrix associated with this subtree. By induction, one may assume that t V − t ≤ ( L − t ) / ( T − t ) . Now V is of the form t J + V − t , where J = t is a square matrix of ones. It is easy to check that V − = V − − t / (1 + t t V − − t ) so that t V − = (( t V − − t ) − + t ) − ≤ (( T − t ) / ( L − t ) + t ) − .By concavity of the inverse function, ((1 − λ ) /a + λ/b ) − < (1 − λ ) a + λb forall λ in (0 ,
1) and all a > b >
0. Combining the two previous inequalities with λ = t/T , a = ( L − t ) / ( T − t ) and b = 1 yields t V − < L/T and proves theclaim. The equality n e = df P only occurs when the tree is reduced to a singletip, in which case n e = 1 = df P .APPENDIX B: ALMOST SURE CONVERGENCE OF ˆ σ AND ˆΣConvergence of ˆ σ in probability is obtained because ν ˆ σ n /σ has a chi-square distribution with degree of freedom ν = n − r , r being the totalnumber of covariates. The exact knowledge of this distribution providesbounds on tail probabilities. Strong convergence follows from the conver-gence of the series P n P ( | ˆ σ n − σ | > ε ) < ∞ for all ε >
0, which in turn followsfrom the application of Chernov’s bound and derivation of large deviations[Dembo and Zeitouni (1998)]: P (ˆ σ − σ > ε ) ≤ exp( − νI ( ε )) and P (ˆ σ − σ < − ε ) ≤ exp( − νI ( − ε )) where the rate function I ( ε ) = ( ε − log(1 + ε )) / ε > − ν ˆ Σ n = e X t e X =( X − ˆ µ X ) t V − n ( X − ˆ µ X ), with e X as in the proof of Proposition 5, whichhas a Wishart distribution with degree of freedom ν = n − Σ . For each vector c then, c t ν ˆ Σ n c has a chi-square distributionwith variance parameter c t Σc , so that c t ˆ Σ n c converges almost surely to c t Σc by the above argument. Using the indicator of the j th coordinate c = j , then c = i + j , we obtain the strong convergence of ˆ Σ to Σ . C. AN´E
APPENDIX C: SYMMETRIC TREESWith the symmetric sampling from Section 5, eigenvalues of V n are ofthe form λ i = n (cid:18) t i d . . . d i + · · · + t m d . . . d m (cid:19) with multiplicity d . . . d i − ( d i − i if i ≥ d . Indeed, it is easy to exhibit theeigenvectors of each λ i . Consider λ for instance. The d descendants of theroot define d groups of tips. If v is a vector such that v j = v k for tips j and k is the same group, then it is easy to see that V n v = λ v . It shows that λ isan eigenvalue with multiplicity d (at least). Now consider an internal nodeat level i . Its descendants form d i groups of tips, which we name G , . . . , G d i .Let v be a vector such that v j = 0 if tip j is not a descendant of the nodeand v j = a g if j is a descendant from group g . Then, if a + · · · + a d i = 0, itis easy to see that V n v λ i v . Since the multiplicities sum to n , all eigenvaluesand eigenvectors have been identified.APPENDIX D: BIC APPROXIMATION Proof of Proposition 7.
The prior π is assumed to be sufficientlysmooth (four times continuously differentiable) and bounded. The same con-ditions are also required for π m defined by π m = sup β π ( β , σ | β ) in model M and π m = sup β ,β top π ( β , σ | β , β top ) in model M . The extra assumptionon π m is pretty mild; it holds when parameters are independent a priori, forinstance.For model M the likelihood can be written − L ( θ ) = − L (ˆ θ ) + n (cid:18) ˆ σ σ − − log ˆ σ σ (cid:19) + (( β − ˆ β ) t X t V − n X ( β − ˆ β ) + t V − n ( β − ˆ β ) + 2( β − ˆ β ) t V − n X ( β − ˆ β )) /σ . Rearranging terms, we get − L ( θ ) = − L (ˆ θ ) + 2 nh n ( θ ) + a n ( β − ˆ β ) / ˆ σ , where a n = t V − n − t V − n X ( X t V − n X ) − X t V − n ,2 h n ( θ ) = (cid:18) ˆ σ σ − − log ˆ σ σ (cid:19) + ( β − u ) t X t V − n X nσ ( β − u )+ a n n ( β − ˆ β ) (cid:18) σ − σ (cid:19) and u = ˆ β − ( β − ˆ β )( X t V − n X ) − X t V − n . For any fixed value of β ,consider h n as a function of β and σ . Its minimum is attained at u and IERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA ˆ σ = ˆ σ + a n ( β − ˆ β ) /n . At this point the minimum value is 2 h n ( u , ˆ σ ) =log(1+ a n ( β − ˆ β ) / ( n ˆ σ )) − a n ( β − ˆ β ) / ( n ˆ σ ) and the second derivative of h n is J n = diag( X t V − n X / ( n ˆ σ ) , / ˆ σ ). Note that ˆ µ X = t V − n X / ( t V − n )is the row vector of estimated ancestral states of X , so by Theorem 1, it isconvergent. Note also that X t V − n X = ( n −
1) ˆ Σ + ( t V − n ) µ X t µ X . Since t V − n is assumed bounded, X t V − n X = n ˆ Σ + O (1) almost surely, and theerror term depends on X only, not on the parameters β or σ . Consequently, a n = t V − n + O ( n − ) is almost surely bounded and ˆ σ = ˆ σ + O ( n − ). Itfollows that for any fixed β , J n converges almost surely to diag( Σ /σ , /σ ).Therefore, its eigenvalues are almost surely bounded and bounded awayfrom zero, and h n is Laplace-regular as defined in Kass, Tierney and Kadane(1990). Theorem 1 in Kass, Tierney and Kadane (1990) shows that − Z e − nh n π dβ dσ = 2 nh n ( u , ˆ σ ) + ( k + 1) log n + log det ˆ Σ − ( k + 1) log(2 π ˆ σ ) + log 2 − π ( ˆ β , ˆ σ | β ) + O ( n − ))with ˆ Σ = X t V − n X /n = ˆ Σ + O ( n − ). Integrating further over β , we get − P ( Y ) = − L (ˆ θ ) + ( k + 1) log n + log det ˆ Σ − ( k + 1) log(2 π ˆ σ ) +log 2 + D n , where D n = − Z exp (cid:18) − n − k −
12 log (cid:18) a n ( β − ˆ β ) n ˆ σ (cid:19)(cid:19) × ( π ( ˆ β , ˆ σ | β ) + O ( n − )) π ( β ) dβ . Heuristically, we see that a n converges to t − = lim t V − n and for fixed β the integrand is equivalent to exp( − ( β − ˆ β ) / (2 t ˆ σ )), so we would con-clude that D n converges to D = − R exp( − ( β − ˆ β ) / (2 t ˆ σ )) π ( β , ˆ β , ˆ σ ) dβ as given in (1) and, thus, − P ( Y ) = − L (ˆ θ ) + ( k + 1) log n + log det ˆ Σ − ( k + 1) log(2 π ˆ σ )+ log 2 + D + o (1) . Formally, we need to check that the O ( n − ) term in D n has an o (1) contri-bution after integration, and that the limit of the integral is the integral ofthe point-wise limit. The integrand in D n is the product of f n ( β ) = n ( k +1) / Z exp(log L ( θ ) − log L (ˆ θ )) π ( β , σ | β ) dβ dσ and of a quantity that converges almost surely: (2 det ˆ Σ ) / (2 π ˆ σ ) − ( k +1) / .Maximizing the likelihood and prior in β , we get that f n is uniformly C. AN´E bounded in β by n ( k +1) / Z exp (cid:18) − n (cid:18) ˆ σ σ − − log ˆ σ σ + ( β − ˆ β ) t ˆ Σ ( β − ˆ β ) /σ (cid:19)(cid:19) × π m ( β , σ ) dβ dσ, where ˆ Σ = ( XV − n X − t V − n ˆ µ tX ˆ µ X ) /n converges almost surely to Σ .Since π m is assumed smooth and bounded, we can apply Theorem 1 fromKass, Tierney and Kadane (1990) again, and f n ( β ) is bounded by(2 det ˆ Σ ) − / (2 π ˆ σ ) ( k +1) / × π m ( ˆ β , ˆ σ ) which is a convergent quantity. There-fore, f n is uniformly bounded and by dominated convergence, the limit of R f n π dβ equals the integral of the point-wise limit so that D n = D + o (1)as claimed in (1).For model M the proof is similar. The value u is now ˆ β − ( X t V − n X ) − (( β − ˆ β ) X t V − n + ( β top − ˆ β top ) X t V − n top ) . The term a n ( β − ˆ β ) is replaced by e β t A n e β , where e β t = ( β − ˆ β , β top − ˆ β top ) and A n is the 2 × a n and t top V − top − t top V − X ( X t V − X ) − X t V − top ,and off-diagonal element t V − top − t V − X ( X t V − X ) − X t V − top . Notethat, as before, elements in A n are dominated by their first term, since X t V − X = n Σ + O (1) almost surely.I show below that A n converges to W − as defined in (2), whose elementsare the limits of t V − n , t top V − n top and t V − n top . The first quantity is t − , finite by assumption. The second quantity equals t e V − , where e V isobtained by pruning the tree from all tips not in the top subtree, so it con-verges and is necessarily smaller than t − . The third quantity exists because( t V − n ) − ( t V − n top ) is ˆ µ top , the estimated state at the root from char-acter top . Theorem 1 cannot be applied to show its convergence, because top is a nonrandom character, but convergence follows from the followingfacts: (a) ˆ µ top is the estimated state at the root from a tree where the topsubtree is reduced to a single “top” leaf whose subtending branch lengthdecreases when more tips are added to the top subtree, to a nonnegativelimit. (b) On the reduced tree, ˆ µ top is the weight with which the top leafcontributes to ancestral state estimation. (c) This weight decreases as moretips are added outside the top subtree. (cid:3) Acknowledgments.
The author is very grateful to Thomas Kurtz for in-sightful discussions on the almost sure convergence result.REFERENCES
Akaike, H. (1974). A new look at the statistical model identification.
IEEE Trans. Au-tomat. Control Akritas, M. and
Arnold, S. (2000). Asymptotics for analysis of variance when thenumber of levels is large.
J. Amer. Statist. Assoc. Beck, R. M. D., Bininda-Emonds, O. R. P., Cardillo, M., Liu, F.-G. R. and
Purvis,A. (2006). A higher-level MRP supertree of placental mammals.
BMC Evol. Biol. Bhattacharya, T., Daniels, M., Heckerman, D., Foley, B., Frahm, N., Kadie,C., Carlson, J., Yusim, K., McMahon, B., Gaschen, B., Mallal, S., Mullins,J., Nickle, D., Herbeck, J., Rousseau, C., Learn, G., Miura, T., Brander, C.,Walker, B. D. and
Korber, B. (2007). Founder effects in the assessment of HIVpolymorphisms and hla allele associations.
Science
Blomberg, S. P., Garland, Jr., T. and
Ives, A. R. (2003). Testing for phylogeneticsignal in comparative data: behavioral traits are more labile.
Evolution Burnham, K. P. and
Anderson, D. R. (2002).
Model selection and multimodel inference:A Practical Information-Theoretic Approach , 2nd ed. Springer, New York. MR1919620
Butler, M. A. and
King, A. A. (2004). Phylogenetic comparative analysis: A modelingapproach for adaptive evolution.
The American Naturalist
Cardillo, M., Mace, G. M., Jones, K. E., Bielby, J., Bininda-Emonds, O. R.P., Sechrest, W., Orme, C. D. L. and
Purvis, A. (2005). Multiple causes of highextinction risk in large mammal species.
Science
Cunningham, C. W., Omland, K. E. and
Oakley, T. H. (1998). Reconstructing ances-tral character states: a critical reappraisal.
Trends in Ecology and Evolution Davis, C. C., Latvis, M., Nickrent, D. L., Wurdack, K. J. and
Baum, D. A. (2007).Floral gigantism in Rafflesiaceae.
Science
Dembo, A. and
Zeitouni, O. (1998).
Large Deviations Techniques and Applications , 2nded. Springer, New York. MR1619036
Dressler, R. L. (1993).
Phylogeny and Classification of the Orchid Family . DioscoridesPress, USA.
Felsenstein, J. (1985). Phylogenies and the comparative method.
The American Natu-ralist
Felsenstein, J. (2004).
Inferring Phylogenies . Sinauer Associates, Sunderland, MA.
Fu, Y.-X. and
Li, W.-H. (1993). Maximum likelihood estimation of population parame-ters.
Genetics
Garland, T., Jr., Bennett, A. F. and
Rezende, E. L. (2005). Phylogenetic approachesin comparative physiology.
J. Experimental Biology
Garland, T., Jr., Dickerman, A. W., Janis, C. M. and
Jones, J. A. (1993). Phylo-genetic analysis of covariance by computer simulation.
Systematic Biology Garland, T., Jr. and
Ives, A. R. (2000). Using the past to predict the present: Con-fidence intervals for regression equations in phylogenetic comparative methods.
TheAmerican Naturalist
Gu, X. (2004). Statistical framework for phylogenomic analysis of gene family expressionprofiles.
Genetics
Guo, H., Weiss, R. E., Gu, X. and
Suchard, M. A. (2007). Time squared: Repeatedmeasures on phylogenies.
Molecular Biology Evolution G¨uven, B. (2006). The limiting distribution of the F-statistic from nonnormal universes.
Statistics Hansen, T. F. (1997). Stabilizing selection and the comparative analysis of adaptation.
Evolution Hansen, T. F. and
Martins, E. P. (1996). Translating between microevolutionary pro-cess and macroevolutionary patterns: The correlation structure of interspecific data.
Evolution C. AN´E
Harvey, P. H. and
Pagel, M. (1991).
The Comparative Method in Evolutionary Biology .Oxford Univ. Press.
Housworth, E. A., Martins, E. P. and
Lynch, M. (2004). The phylogenetic mixedmodel.
The American Naturalist
Huelsenbeck, J. P. and
Bollback, J. (2001). Empirical and hierarchical Bayesian es-timation of ancestral states.
Systematic Biology Johnson, N. L. and
Kotz, S. (1972).
Distributions in Statistics: Continuous MultivariateDistributions . Wiley, New York. MR0418337
Jønsson, K. A. and
Fjelds˚a, J. (2006). A phylogenetic supertree of oscine passerinebirds (Aves: Passeri). Zoologica Scripta Kass, R. E. and
Raftery, A. E. (1995). Bayes factors.
J. Amer. Statist. Assoc. Kass, R. E., Tierney, L. and
Kadane, J. B. (1990). The validity of posterior expan-sions based on Laplace’s method. In
Bayesian and Likelihood methods in Statistics andEconometrics
Kass, R. E. and
Wasserman, L. (1995). A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion.
J. Amer. Statist. Assoc. Kass, R. E. and
Wasserman, L. (1996). The selection of prior distributions by formalrules.
J. Amer. Statist. Assoc. Mace, R. and
Holden, C. J. (2005). A phylogenetic approach to cultural evolution.
Trends in Ecology and Evolution Martins, E. P. (2000). Adaptation and the comparative method.
Trends in Ecology andEvolution Martins, E. P. and
Hansen, T. F. (1997). Phylogenies and the comparative method: Ageneral approach to incorporating phylogenetic information into the analysis of inter-specific data.
The American Naturalist
McArdle, B. and
Rodrigo, A. G. (1994). Estimating the ancestral states of acontinuous-valued character using squared-change parsimony: An analytical solution.
Systematic Biology Pagel, M. (1999). The maximum likelihood approach to reconstructing ancestral charac-ter states of discrete characters on phylogenies.
Systematic Biology Pagel, M., Atkinson, Q. D. and
Meade, A. (2007). Frequency of word-use predictsrates of lexical evolution throughout indo-european history.
Nature
Pagel, M., Meade, A. and
Barker, D. (2004). Bayesian estimation of ancestral char-acter states on phylogenies.
Systematic Biology Paradis, E. and
Claude, J. (2002). Analysis of comparative data using generalizedestimating equations.
J. Theoret. Biology
Raftery, A. E. (1995). Bayesian model selection in social research.
Sociological Method-ology Raftery, A. E. (1996). Approximate Bayes factors and accounting for model uncertaintyin generalised linear models.
Biometrika Rao, C. R. (1973).
Linear Statistical Inference and Its Applications , 2nd ed. Wiley, NewYork. MR0346957
Rohlf, F. J. (2006). A comment on phylogenetic regression.
Evolution Schluter, D., Price, T., Mooers, A. O. and
Ludwig, D. (1997). Likelihood of ancestorstates in adaptive radiation.
Evolution Schwarz, G. (1978). Estimating the dimension of a model.
Ann. Statist. Semple, C. and
Steel, M. (2003).
Phylogenetics . Oxford Univ. Press, New York.MR2060009
Spooner, D. M. and
Hijmans, R. J. (2001). Potato systematics and germplasm collect-ing, 1989-2000.
American J. Potato Research Tajima, F. (1983). Evolutionary relationship of DNA sequences in finite populations.
Genetics
Verd´u, M. and
Gleiser, G. (2006). Adaptive evolution of reproductive and vegetativetraits driven by breeding systems.
New Phytologist
Wang, H. and
Akritas, M. (2004). Rank tests for ANOVA with large number of factorlevels.
J. Nonparametr. Stat. Wasserman, L. (2000). Bayesian model selection and model averaging.
J. Math. Psych. Zhang, H. and
Zimmerman, D. L. (2005). Towards reconciling two asymptotic frame-works in spatial statistics.