A new distance between DNA sequences
AA new distance between DNA sequences
Viswanath.C.Narayanan Govt. Engineering College, Thrissur, India Email: [email protected] Key words: Continuous time Markov Chain, Jump process, Generator matrix, Phylogenetic tree.
Abstract Motivation:
We propose a new distance metric for DNA sequences, which can be defined on any evolutionary Markov model with infinitesimal generator matrix Q. That is the new metric can be defined under existing models such as Jukes-Cantor model, Kimura-2-parameter model, F84 model, GTR model etc. Since our metric does not depend on the form of the generator matrix Q, it can be defined for very general models including those with varying nucleotide substitution rates among lineages. This makes our metric widely applicable. The simulation experiments carried out shows that the new metric, when defined under classical models such as the JC, F84 and Kimura-2-parameter models, performs better than these existing metrics in recovering phylogenetic trees from sequence data. Our simulation experiments also show that the new metric, under a model that allows varying nucleotide substitution rates among lineages, performs equally well or better than its other forms studied.
Distance metrics plays an important role in phylogenetic reconstruction. We have various distance metrics used in phylogenetic analysis which are based on different DNA substitution models such as: Jukes-Cantor [1], Kimura [2,3], Felsenstein [4,5], Hasegawa, Kishino and Yano [6,7], Tamura and Nei [9], Posada [10] and Tavare [8]. In each of these models the substitution process is a continuous-time Markov chain (see [12]) with states {A,C,G,T}, a 4 × π and a 4 × ℜ with state space {A,C,G,T} and infinitesimal generator Q. Now identifying the mutations in the evolutionary process with jumps in the Markov process ℜ , we can trace out all the mutations. The average number of mutations in the evolutionary process is the average number of jumps in ℜ . This number is calculated and is taken as the distance between the wo DNA sequences. Thus we have a new distance between two DNA sequences which gives a direct measure of the average number of mutations including the hidden ones; and which can be defined under a variety of evolutionary models including those which can handle the cases of varying nucleotide substitution rates among lineages. For testing the efficiency of the new metric, we defined our metric on three existing models namely the Jukes-Cantor model, the Kimura-2-parameter model, and the F84 model and used each of them in phylogenetic tree reconstruction from simulated as well as real data. The efficiency of the new metric in each case is compared with that of the corresponding existing metric in terms of the distance of the recovered tree from the true tree. In many cases, the simulation experiments showed a more efficient performance of our metric over the corresponding existing ones. We also tested the efficiency of our metric, defining it under a fourth model, which can afford varying nucleotide substitution rates across lineages. Our simulation results show that this fourth definition performs equally well with the other three definitions. The paper is arranged as follows. In section 2 the definition of the distance function and its explicit expression in some particular cases are given. Section 3 discusses the efficiency of the new metric in recovering phylogenetic trees from simulated as well as real data and section 4 concludes the discussion . Let F be the frequency matrix obtained by comparing sites in two DNA sequences X and Y and is given by, F =
FFFFT FFFFG FFFFC FFFFA TGCA
Χ Υ
For defining a distance between the DNA sequences X and Y, assume that the evolution process, which a cite in X and Y undergoes, is driven by a continuous time Markov chain Ψ with some 4 x 4 generator matrix Q. Identify ‘mutations ‘, which a cite undergoes during the evolution process, with ‘jumps’ in the process Ψ . Let the state space of the process Ψ be {1, 2, 3, 4} and )( tE ij denote the average number of jumps in Ψ which has reached the state j at time t, starting form state i. Then )( tE ij gives the average number of mutations in the interval [0,t) which a cite undergoes during the evolution process between X and Y. Now we can define the average number of mutations n the whole evolution process between X and Y in the time interval [0,t), which is taken as the distance between the sequences X and Y, to be; )( td Q = ∑∑ == + )(2 ji ijji ijjiij F tEFF . ……………(1) We see that by definition, )( td Q is symmetric; )( td Q = Ψ that is if and only if X=Y. Also the definition of )( td Q as the average number of mutations implies that it satisfies the triangle inequality. Hence we have the following theorem. Theorem 2.1.1
When t is fixed, )( td Q is a metric. )( tE ij Let N(s, s+t) be the number of jumps in the process Ψ during the interval (s, s+t]. For 1 ≤≤ ji , 4, n + Ζ∈ , define, the probabilities; ))(|),(,)(Pr(),( isntssNjtsntP ij =Ψ=+=+Ψ= ……..(2) Then )( tE ij can be obtained as: )( tE ij = ∑ ∞= ),( n ij ntnP ……..(3) The probabilities in equation (2) satisfies the following equations (see [17]): )exp()0,( tQtP iiijij δ= ……..(4) and for n ≥ , ∑ ∫ ≠= −=+
41 0 .)exp(),()1,( ill t iiljilij dyyQnytPQntP ……..(5)
Now, let E(t) be the matrix whose (i,j) th entry is )( tE ij , )(~ tE be the diagonal matrix whose i th diagonal entry is )exp( tQ ii , and J be the matrix whose (i,j) th entry is ij Q , for ji ≠ , and whose diagonal entries are zeros. Then equation (5) can be transformed into the matrix integral equation; ∫ ∫ −+−= t t dyytJPyEdyytJEyEtE .)()(~)()(~)( ………(6) Solving this equation gives the required )( tE ij s’. Example 2.1.1
If we take the generator matrix Q as −−−−−−−−−−−−= γβααβγ αγβαγβ βγγβαα γβαγβα Q , which is the rate matrix in the Kimura 3-parameter model, then )()( tEtE jiij = for all i,j and ∑ ∞= −++−= )!1())(exp()( n nnii ntHttE γβα , ∑ ∞= −++−== )!1())(exp()()( n nn ntHttEtE γβα , ∑ ∞= −++−== )!1(~))(exp()()( n nn ntHttEtE γβα , ∑ ∞= −++−== )!1(ˆ))(exp()()( n nn ntHttEtE γβα , with nnnn HHHH ˆ~ γβα ++= + ; 1 ≥ n , nnnn HHHH ˆ~ βγα ++= + ; 2 ≥ n , nnnn HHHH ˆ~ αγβ ++= + ; 2 ≥ n , nnnn HHHH ~ˆ αβγ ++= + ; 2 ≥ n . For the Jukes-Cantor model, which is the particular case of the above model with µγβα === , we have more explicit expressions of )( tE ij ’s as: + + + + + − = L !4461!3420!247!142143exp4)( µµµµµµ tttttttE ij , for jiji ≠≤≤ ,4,1 , and for 41 ≤≤ i , + + + + − = L !4460!3421!246!14343exp4)( µµµµµµ tttttttE ii . ,2ˆ,2~,2 ,ˆ,~,
222 111 αβαγγβ γβα === ===
HHH HHH . Software packages Mesquite [19] and Seq-Gen.v1.3.2 [20] are used for simulating sequence data on given random and real trees respectively. Real trees are taken from TreeFam Database [21,22]. MBEToolbox 2.0 [23] is used for getting the distance matrices using the new metric. For getting distance matrices using the JC, K2P, F84 distances, and for phylogenetic tree construction with the new as well as the existing metrics, PHYLIP 3.66 [24] is used. For comparing the phylogenetic trees recovered from the sequence data with the original tree, we used the package TOPD/FMTS [25].
3. Phylogenetic tree reconstruction based on )( td Q . To substantiate the claim that the new metric improves the existing metrics such as the Jukes-Cantor metric, F84 metric and Kimura-2-parameter metric, we conducted simulation experiments with simulated as well as real data. The details of these experiments are as follows.
Clearly, the challenge with the metric )( td Q is the selection of the generator matrix Q. We defined )( td Q in four different ways. The first three definitions were based on the Jukes-Cantor, F84 and Kimura-2-parameter models respectively. The fourth one is obtained first by defining the matrices P , P ~ and G as: ∑ = = k ikijij FFP , ∑ = = ~ k kijiij FFP , G = ( ) PP ~21 + ; and then taking , ijij GQ = for ji ≠ . For easy identification of each of these definitions, let us rename )( td Q in each of these cases as )( td QJ , )( td QF , )( td QK and )( td QD respectively. We conducted three types of experiments. The basic nature of each type was to select trees first, either real or random, and then try to recover the phylogenies from DNA sequence data, either simulated or actual, using neighbor-joining method together with the new as well as existing metrics. The recovered trees are then compared with the original tree, in terms of the nodal and split distances, using the software TOPD/FMTS. In the type1 experiment, using the software package Mesquite, we selected 100 random trees with number of species ranging from 25 to 124 and then with the composite simulation model, simulated DNA sequences each of length 1000 bases. We then tried to recover the original tree from the simulated sequence data applying the neighbor-joining method with the new as well as the existing metrics. Comparing the recovered trees with the original tree, the following results were obtained. The average nodal distances using the new metrics )( td QJ , )( td QF and )( td QK were 2.5776, 2.6919, 2.6991 respectively; whereas the average nodal distances using the corresponding existing metrics were 4.7693, 10.1901, 9.752 respectively. These average values shows that the new metric can mprove the existing metrics and also that the new metric performs slightly better under the Jukes-Cantor model. Figures 1-3 shows the comparison of the new metric with the corresponding existing metric in terms of nodal distance. Figure 4 shows the comparison of the metrics )( td QJ , )( td QF and )( td QK , which are various forms of the new metric under different models, in terms of the nodal distance.. The average split distance with all the three existing metrics was found to be 1, the maximum possible value. The average split distances using the new metrics )( td QJ , )( td QF and )( td QK was 0.3735, 0.4516, 0.4478 respectively, pointing to the better performance of the new metric. Figure 5 shows the comparison of the various forms of the new metric under different models in terms of the split distance. For the calculation of the new metric, here the time t is taken as 0.5. For the second type of experiments, we selected 25 real trees, with number of sequences varying between 4 and 63, from TreeFam Database. DNA sequences are then simulated for these trees using Seq-Gen. For each tree, we simulated three types of DNA sequences by multiplying the branch lengths by 1,10 and 100 respectively. In the first case, that is when we used the sequences simulated on trees with same branch lengths as obtained from TreeFam, we found our metric performing less efficiently as compared to the existing ones. Here we observed that the simulated sequences were too close to each other; so that the nondiagonal elements in each raw of the frequency matrix F were weaker compared to its diagonal elements. Concluding that this may be the reason for the poor performance of our metric, we then simulated sequences by multiplying the branch lengths of each tree with 10. Since this increased the difference between the simulated sequences and strengthened the nondiagonal elements of F, this time we anticipated a better performance of our metric over the existing metrics. Except for the Kimura-2-parameter distance, we got the expected result. That is the new metric, though narrowly, improved the Jukes-Cantor and F84 metrics. A third simulation is then carried out by multiplying the branch lengths with 100. This time also the new metric couldn’t improve the Kimura-2-parameter metric; but the improvement brought by the new metric to the Jukes-Cantor and F84 metrics became more evident. These experimental results in terms of average nodal and split distances are given in table1. Figures 6-8 shows the comparison results between the new and Jukes-Cantor metrics in terms of the nodal distance, when branch lengths are multiplied by 1,10 and 100 respectively. Figures 9-11 show the split distance comparison results between the new and Jukes-Cantor metrics, as branch lengths are multiplied by 1,10 and 100 respectively. Figures 12-17 show the above results for the new and Felsenstein 84 metrics and these results for the new and Kimura-2-parameter distance metrics are given in figures 18-23. For the calculation of the new metric, here the time t is taken as 1.5. The third experiment was based on a known tree of 11 vertebrate species; the same tree studied in Russo et al. [18]. The 11 species and their GenBank Accession numbers are given in Table 2. We tried to recover this known tree from nucleotide sequence data using the existing as well as the new metrics. Distance comparison is done as in the above two experiments. This experiment also suggested that the new metric improves the existing metrics. The results are given in Table 3. According to this table, the efficiency of the new metric in recovering correct phylogenies from real sequence data is comparatively maximum, when it is defined with the Kimura-2-parameter model. For the calculation of the new metric, here the time t is taken as 1.5. All the three type of experiments show that the new metric, when defined on the existing evolutionary models such as the Jukes-Cantor, Felsenstein 84, and Kimura-2-parameter models, recovers better phylogenetic trees from the sequence data than the existing metrics. The results also show that the new metric, defined on the fourth model, which can afford varying nucleotide substitution rates across lineages, performs equally well to the best performer among the other three new metrics. Table 1:
Average distance comparison of the new metric with the existing metrics Original branch length Branch length x 10 Branch length x 100 J-C metric 1.259 1.7648 4.0657 )( td QJ td QF td QK td QD td QJ td QF td QK td QD Table 2:
The 11 vertebrate species and their GenBank accession numbers. Species GenBank accession numbers Balaenoptera physalus X61145 Balaenoptera musculus X72204 Bos taurus V00654 Mus musculus V00711 Rattus norvegicus X14848 Didelphis virginiana Z29573 Gallus gallus X52392 Xenopus laevis M10217 Oncorhynchus mykiss L29771 Crossostoma lacustre M91245 Cyprinus carpio X61010
Table 3:
Distance comparison of the new metric with existing metrics based on a known tree of 11 vertebrate species Nodal distance Split distance J-C metric 2.5154 0.75 )( td QJ td QF td QK td QD
4. Conclusion
The distance metric )( td Q defined here has the following advantages. 1. It can be defined under any evolutionary Markov model with rate matrix Q; without any further assumption on the matrix Q. This makes it definable under very general models and therefore is widely applicable. 2. It measures the average number of base substitutions including mutations. 3. When defined under classical models such as the JC, F84 and Kimura-2-parameter models, it recovered good phylogenetic trees from simulated as well as real sequence data on a given tree. 4. The simulation experiments shows that the new metric, when defined under the JC, K2P and F84 models, improves these existing metrics as far as recovering phylogenetic trees from sequence data is concerned. . In the simulation experiments carried out, the definition of the new metric, under a model that allows varying nucleotide substitution rates among lineages, showed a second best performance among the various definitions of the new metric. References