Synonymous and Nonsynonymous Distances Help Untangle Convergent Evolution and Recombination
Peter B. Chi, Sujay Chattopadhyay, Philippe Lemey, Evgeni V. Sokurenko, Vladimir N. Minin
SSynonymous and Nonsynonymous Distances Help UntangleConvergent Evolution and Recombination
Peter B. Chi , Sujay Chattopadhyay , Philippe Lemey , Evgeni V. Sokurenko andVladimir N. Minin , ∗ Department of Statistics, California Polytechnic State University, San Luis Obispo, CA,93407, USA Department of Microbiology, University of Washington, Seattle, WA, 98195, USA Department of Microbiology and Immunology, Rega Institute, KU Leuven - University ofLeuven, B-3000 Leuven, Belgium Departments of Statistics and Biology, University of Washington, Seattle, WA, 98195,USA ∗ Address for correspondene: [email protected]
Abstract
When estimating a phylogeny from a multiple sequence alignment, researchers oftenassume the absence of recombination. However, if recombination is present, then treeestimation and all downstream analyses will be impacted, because different segmentsof the sequence alignment support different phylogenies. Similarly, convergent selectivepressures at the molecular level can also lead to phylogenetic tree incongruence acrossthe sequence alignment. Current methods for detection of phylogenetic incongruence arenot equipped to distinguish between these two different mechanisms and assume thatthe incongruence is a result of recombination or other horizontal transfer of geneticinformation. We propose a new recombination detection method that can make thisdistinction, based on synonymous codon substitution distances. Although some poweris lost by discarding the information contained in the nonsynonymous substitutions,our new method has lower false positive probabilities than the comparable recombina-tion detection method when the phylogenetic incongruence signal is due to convergentevolution. We apply our method to three empirical examples, where we analyze: 1)sequences from a transmission network of the human immunodeficiency virus, 2) tlpB gene sequences from a geographically diverse set of 38
Helicobacter pylori strains, and3) Hepatitis C virus sequences sampled longitudinally from one patient. a r X i v : . [ s t a t . M E ] O c t Introduction
The field of phylogenetics aims to describe evolutionary relationships among homologous se-quences, by inferring a phylogeny, or evolutionary tree (Felsenstein, 2004). In the estimationof a phylogenetic tree from a molecular sequence alignment, the absence of recombinationis frequently assumed, meaning that every site along the sequence alignment has the sameevolutionary history/phylogeny. Implications of recombination on tree estimation (Posadaand Crandall, 2002; Felsenstein, 2004) and downstream analyses (Schierup and Hein, 2000;Anisimova et al., 2003) have motivated the development of a plethora of tests for the pres-ence of recombination (Awadalla, 2003; Martin et al., 2011). Most of these methods try totest whether there are segments of the sequence alignment that support different phyloge-nies; if so, such phylogenetic incongruence is used as evidence of recombination (Grasslyand Holmes, 1997; McGuire et al., 1997; Posada and Crandall, 2001). However, anotherevolutionary force can produce an observed data pattern similar to the one produced byrecombination. Suppose that the same selective pressure acts upon two sequences to makethem appear more closely related to each other than they are under the true evolution-ary history. Now, if this phenomenon, known as convergent evolution (Wake et al., 2011),occurs between these two sequences only at a localized region of the alignment, then itwill appear as if this region has a different evolutionary history than the remainder of thealignment, leading to an observed phylogenetic incongruency. To our knowledge, no exist-ing method for detecting phylogenetic incongruence can distinguish between recombinationand convergent evolution. In this paper, we develop one method that can accomplish thistask.As a starting point, we consider the Dss method proposed by McGuire et al. (1997)and implemented in the TOPALi software (Milne et al., 2004). Dss, an abbreviation for“difference in the sum of squares,” is a sliding window approach that scans across thesequence alignment in question, with the assumption that if a recombination breakpointis present within any given window, then the portions of the window on opposite sides ofthe breakpoint would have distinct evolutionary trees. The test statistic produced by theDss method is based upon the pairwise distance matrices for each half of every windowacross the alignment, and an extreme value of the statistic is expected to occur in a windowthat contains a recombination event at its center. Our proposed modification is to basethe test statistic on a measure of evolutionary distance that considers only synonymoussubstitutions: the codon changes that do not result in amino acid changes. Assuming thatselection predominantly acts on the amino acid level, it follows that selection has mainlyan effect on nonsynonymous substitutions.Since synonymous substitutions provide ‘neutral’ information about evolutionary rela-tionships of sequences under study (Lemey et al., 2005; Yang, 2006; O’Brien et al., 2009),we postulate that using a distance metric which considers only synonymous substitutionswithin the Dss framework would still allow for recombination detection, but will avoidthe false positives resulting from convergent evolution. We develop a new statistic, and anovel parametric bootstrap method to access the distribution of this statistic under the nullhypothesis of no recombination.To test our new recombination detection method, we first proceed via simulations tocompare its performance to the original Dss statistic, both in terms of their ability to identifytrue recombination events, and to avoid false positives due to convergent evolution. Wealso examine three real data examples. The first is a human immunodeficiency virus (HIV)dataset, which comes from nine Belgian patients that belong to a known HIV transmission2hain (Lemey et al., 2005). This dataset has been of particular interest because phylogeneticreconstructions can be compared to the known transmission chain, providing a real dataexample in which estimation procedures can be validated. In their work, Lemey et al. (2005)studied two distinct HIV genes: pol and env , and concluded that the pol gene was underconvergent selective pressures, whereas the env gene was not. Here, we revisit this questionwith our method, by examining a concatenated alignment of the pol and env genes. Oursecond real data example is a sequence alignment of the tlpB gene encoding the methyl-accepting chemotaxis protein in
Helicobacter pylori . The importance of the TlpB proteinlies in its role as a chemoreceptor and also in colonizing the bug to the gastric mucosa of itshost (Croxen et al., 2006). Interestingly, using multiple recombination detection statistics,we found evidence of recombination in tlpB . Therefore, we choose this important gene toanalyze the distribution of recombination signals across synonymous and nonsynonymouschanges via our Dss statistics, and to determine evidence of actual recombination events.Finally, we investigate a Hepatitis C virus (HCV) sequence alignment from serum samplescollected over roughly 10 years from one chronically infected individual (Palmer et al.,2012). In their work, Palmer et al. (2012) examined sequences of the hypervariable region1 (HVR1) of the HCV genome and found evidence of recombination between two distinctviral populations residing in the individual. However, HVR1 is subject to selective pressure,as antibody responses to HCV infection target this region (Zibert et al., 1995). Thus, weanalyze this dataset with our Dss statistics to again test whether the recombination signalis due to a true recombination, or due to convergent evolution.
Let Y be a matrix that represents a DNA sequence alignment, composed of row vectors y , . . . , y n , where n is the number of taxa/species/sequences. Then, y k = ( y k , . . . , y kL ),where L is the length, or number of sites in the sequence alignment. For a given DNAsequence alignment, one common summary of the data is the distance matrix d = { d kl } ,where each element d kl is the distance between sequences k and l , for k, l ∈ (1 , . . . , n ).Intuitively, each pairwise distance simply indicates how different two sequences are fromeach other. For example, two sequences that are identical at every site along the alignmentwould have a distance of 0, under most sensible measures of distance.Typically, one assumes a substitution model that is defined by the rates of changebetween the possible states. For a pair of taxa, the evolution of each site ( y ks , y ls ) for s ∈ (1 , . . . , L ) can thus be described by a continuous-time Markov chain (CTMC) withinfinitesimal generator Λ = { λ ij } for i, j ∈ (1 , . . . , M ), where M is the number of states(e.g. for DNA nucleotide data, M = 4 as the state space is { A, C, G, T } ; for DNA codons, M = 64 as the state space is { A, C, G, T } ), and the rate of leaving state i is λ i ≡ (cid:80) Mj (cid:54) = i λ ij .With stationary distribution π ≡ ( π , . . . , π M ), we then can calculate distances for eachpair ( k, l ) as ˆ d kl = M (cid:88) i =1 ˆ π i ˆ λ i , (1)with the necessary parameter estimates being calculated from the data. As this quantityis equal to the average number of jumps in a stationary continuous-time Markov chain,3volutionary distances are thus defined as the expected number of substitutions per site,according to the given continuous-time Markov chain model.The notion of an evolutionary distance can be generalized to consider certain subsets ofsubstitutions. For example, it is sometimes of biological interest to count only transitions( A (cid:10) G and C (cid:10) T ), or transversions ( A (cid:10) T , A (cid:10) C , G (cid:10) T , and G (cid:10) C ). Avariety of ad-hoc strategies could be used to account for this (Felsenstein, 2004), but it canalso be formally incorporated into the framework of CTMC models of DNA evolution aswas demonstrated by O’Brien et al. (2009). First, we define the set L to be the subset ofthe lattice { , . . . , M } that indicates the substitutions which we wish to count; that is,( i, j ) ∈ L if i → j is a substitution of interest. Then, we can express distances for anylabeled subset of substitutions asˆ d L = M (cid:88) i =1 ˆ π i M (cid:88) j (cid:54) = i ˆ λ ij { ( i,j ) ∈L} , (2)for each pair of sequences ( k, l ). The indicator function 1 { ( i,j ) ∈L} in (2) is equal to 1 if i → j is a substitution of interest, and 0 if it is not. In this manner, the labeled distance metricdoes not count the substitutions that are not of interest. In this work, we specifically appealto the labeled subset known as synonymous substitutions: the changes in codon state whichdo not result in a change in amino acid. The Dss method for the detection of recombination is based upon evolutionary distances.First, with an estimated distance matrix, one can infer a phylogeny, which consists formallyof a topology τ and branch lengths b . The tree distance t kl ( τ , b ) between taxa k and l is then the sum of the branch lengths between them on any particular topology. Withˆ d kl estimated from DNA sequence data as above, least squares phylogenetic inference thenproceeds by finding argmin τ , b (cid:88) k,l (cid:104) ˆ d kl − t kl ( τ , b ) (cid:105) , (3)which is the usual least squares criterion. The solution (ˆ τ , ˆ b ) to (3) then gives the leastsquares phylogeny.Now, we define recombination as an exchange of genetic material between two taxa thatresults in different evolutionary histories for the different respective parts of the sequencealignment. The Dss method then uses a sliding window approach (McGuire et al., 1997;McGuire and Wright, 2000; Milne et al., 2004), as illustrated in Figure 1, where the twopanels show a window (in red) moving across a sequence alignment.First, the average of all estimated pairwise distances from the entire sequence alignmentis recorded as d . Next, the distance matrix is estimated with a DNA substitution model forthe first half of a given window, along with its mean, w { } . This distance matrix is thenstandardized by multiplying each entry by d/w { } , and the resulting standardized distancematrix for the first half of the window is recorded as ˆ d { } = { ˆ d { } kl } . Using phylogeneticleast squares as described above, we then calculateargmin τ , b (cid:88) k,l (cid:104) ˆ d { } kl − t kl ( τ , b ) (cid:105) . (4)4 axa 1 TACACACGTAGATTAGCCCCTAACAATGACCCCCGGCTGATTGCTTGTaxa 2 TACACATGTAGATTAGCCCCTAACAATGACCCCCGGCTGATTGCTTGTaxa 3 TACACATGTAGATTAGCTCCTAACAATGGCCCCCAGCTGACTGCTTGTaxa 4 TACACATGTAGATTAGCTCCTAACAATGGCCCCCAGCTGACTGCTTGTaxa 1 TACACACGTAGATTAGCCCCTAACAATGACCCCCGGCTGATTGCTTGTaxa 2 TACACATGTAGATTAGCCCCTAACAATGACCCCCGGCTGATTGCTTGTaxa 3 TACACATGTAGATTAGCTCCTAACAATGGCCCCCAGCTGACTGCTTGTaxa 4 TACACATGTAGATTAGCTCCTAACAATGGCCCCCAGCTGACTGCTTG Figure 1: Illustration of two overlapping sliding windows, shown as red boxes, across asequence alignment of four taxa. The vertical grey lines divide each window in half.The estimated topology is recorded as ˆ τ { } , and the minimized value of the sum of squaresin (4) is recorded as SSa Fw .For the second half of the window, again the distance matrix is estimated, with its meanstored as w { } and again the standardized distance matrix is calculated by multiplying eachentry by d/w { } to obtain ˆ d { } = { ˆ d { } kl } . Now, we calculate SSb Fw = min b (cid:88) k,l (cid:104) ˆ d { } kl − t kl (ˆ τ { } , b ) (cid:105) . (5)That is, the topology from the first half is imposed as fixed in (5), and only the branchlengths are optimized according to the sequence alignment of the second half of the window.For each window, we then have Dss Fw = ( SSa Fw − SSb Fw ). The entire procedure isrepeated in the reverse direction, by starting with a window at the end of the alignment,swapping the roles of each half of the window, and then sliding it backwards across thesequence alignment; this gives Dss Bw for each window. Then, Dss w = max( Dss Fw , Dss Bw ).Finally, here we consider only the maximum Dss statistic from all windows, giving us Dss max = max w ( Dss w ).Our modification of the Dss statistic uses estimates of labeled distances for synonymoussubstitutions, estimated by the method developed by O’Brien et al. (2009), replacing d and ˆ d {·} kl in the calculation of Dss max above. The original Dss method tests for discrepantphylogenies throughout windows across the sequence alignment. However, it cannot distin-guish between the case where the discrepancies are actually due to an exchange of geneticmaterial, as opposed to convergent selective pressure. In other words, with the Dss method,the null hypothesis is that the sequence alignment has one true evolutionary history thathas been affected neither by recombination nor convergent evolution, and evidence againstthe null hypothesis may be due to either recombination or convergent evolution, or due tothe presence of both of these events.With our new synonymous Dss statistic, our null hypothesis remains the same, but ouraim is that our new test will detect only departures from the null that are due to an exchangeof genetic material. Under the assumption that selective pressure acts on the amino acidlevel, synonymous substitutions are presumed to be neutral, and therefore distances basedupon them would ignore selective pressures. In this manner, potential false positive signalsfor recombination due to selection can be avoided.5 .3 Modified Parametric Bootstrap
To assess statistical significance, McGuire and Wright (2000) propose a parametric boot-strap to generate the null distribution of the test statistic. Under this parametric bootstrap,the distribution of
Dss max is simulated under the null hypothesis as follows: first, the Or-dinary Least Squares tree for the entire sequence alignment is obtained, under the chosenmodel of substitution. In this manner, the data are treated as if the sequences were inher-ited through one true tree and substitution model, in accordance with the null hypothesis.Next, sequence data are simulated under this tree, B times. Finally, the Dss values arecalculated under the same procedure as outlined above for each simulated sequence align-ment, and saving only the maximum from each simulated realization. Thus, one obtainsthe distribution of the maximum Dss statistic under the null hypothesis. This gives thebasis for determining how extreme an observed Dss statistic is, and we calculate the MonteCarlo estimate of the p-value as the proportion of simulated null Dss values that are moreextreme than our observed value in question.However, we must make important modifications to the parametric bootstrap for as-sessing statistical significance of the Dss statistic as first proposed by McGuire and Wright(2000). First, we estimate the distances between sequences on the codon scale (e.g. theexpected number of substitutions per codon site). Using this distance matrix, we thenestimate the least squares tree, which represents the null hypothesis of the evolutionaryhistory of the sequences: with no recombination or convergent evolution. Next, we estimatecodon substitution parameters from the codon model proposed by Nielsen and Yang (1998): κ (the transition/transversion ratio) and Ω = (Ω , Ω , Ω ) where each Ω i is a nonsynony-mous/synonymous rate ratio, with corresponding proportions p = ( p , p , p ) where each p i represents the probability that Ω i will be selected as the nonsynonymous/synonymousrate ratio for any given site. This mixture of three codon models allows for estimationof variable nonsynonymous/synonymous rate ratios at each site, to simulate the bootstrapdata as similarly as possible to the evolutionary process that created the original data.With the null evolutionary history, κ , Ω and p estimated from the sequence data, we thensimulate our parametric bootstrap sequence alignment datasets. Using these, we calculatethe original Dss statistic and the synonymous Dss statistic in the manner described above,to then obtain the distribution of the maximum, for each. All analyses have been performed using the R package synDss , in which we implementedthe proposed methodology. The package contains our implementation of the original Dssmethod, our synonymous Dss method, and modified parametric bootstrap. The source codeand binaries are available at http://evolmod.r-forge.r-project.org/ . To assess performance of each statistic, we simulate sequences under a codon model usingthe software package PAML (Yang, 2007). Three basic scenarios are considered: 1) null; 2)true recombination event; 3) localized convergent evolution. For each scenario, we considera sequence alignment with five taxa, and we set κ = 2 (transition/transversion ratio), and6ample Ω (nonsynonymous/synonymous rate ratio) from a discrete mixture model with val-ues Ω = (0 . , . , . p = (0 . , . , . p = (0 . , . , .
01) and p = (0 . , . , . .
80 and 0 .
67 relative to the original branch lengths, to result inscenarios of “high diversity,” “medium diversity,” and “low diversity,” respectively.With α = 0 .
05, Type I error probabilities under 1000 replicates of a simulated nullscenario appear to be well-behaved, with estimated Type I error probabilities of 6 .
6% and6 .
3% for the original Dss and synonymous Dss tests, respectively. Distributions of thep-values resemble a uniform distribution, as shown in Supplementary Figure S1.Next, we examine power to detect recombination, under the scenario with a true recom-bination event. We vary the expected number of substitutions (diversity) and proportionof synonymous substitutions, and examine the corresponding effect on the power of eachversion of the test statistic. Histograms of p-values from the original Dss statistic and syn-onymous Dss statistic from one scenario are shown in Supplementary Figure S2, where weobserve 90% power with the original Dss test statistic, and 76% power with our synonymousDss test statistic. The results from all scenarios are shown in Table 1. Our synonymous Dssstatistic has reduced power in every case, which is to be expected since we have reducedthe amount of information used. The reduction in power is less dramatic in the scenarioswhere a greater proportion of the substitutions are synonymous (bottom row of Table 1),since less information is being discarded in these cases.7 t2 t4 t3t5t1 B t5 t4 t3t2t1 Figure 2: Phylogenies used for simulations. Numbers indicate branch lengths, in expectednumber of substitutions per site between two nodes.Table 1: Power of each test under the recombination scenario. Each column represents oneset of branch lengths (equivalently, the diversity), which correspond to each average powerof the original Dss test. Each cell represents the power of the synonymous Dss statistic,with 95% confidence intervals in parentheses. In each scenario, 100 simulated replicationswere analyzed. Power of original Dss99% 90% 85%50% syn 66 (56.6, 75.4) 38 (28.4, 47.6) 20 (12.1, 27.9)60% syn 79 (70.9, 87.1) 48 (38.1, 57.9) 34 (24.6, 43.4)75% syn 87 (80.3, 93.7) 76 (67.5, 84.5) 62 (52.4, 71.6)Under the scenario of convergent evolution, we compare the false positive probabili-ties under the original Dss statistic and the synonymous Dss statistic. That is, while theDss statistic detects phylogenetic incongruence from any cause, we want to determine ifthe synonymous Dss statistic can avoid giving a significant p-value when the phylogeneticincongruence is due to convergent evolution. Under every scenario, we observe that thefalse positive probability of the synonymous Dss method is substantially lower than thatof the original Dss statistic, as shown in Figure 3. For example, under high diversity and50% synonymous substitutions, the estimated false positive probability for the original Dssstatistic is 33%, vs. 9% for the synonymous Dss statistic. Results from all scenarios areshown in Figure 3, labeled as “Orig” and “Syn” respectively.We next examine whether this reduction in false positive probability might simply be dueto the fact that the synonymous Dss statistic uses less information; that is, it considers onlysynonymous substitutions. To answer this question, we first examine the effect of removinga proportion of sites, corresponding to the proportion of synonymous substitutions. For8xample, under the scenario with 75% synonymous substitutions, we retain 75% of thealignment sites at random, and then obtain the original Dss statistic. We observe that thefalse positive rate under these simulations are similar to that of the original Dss statistic,as shown in Figure 3, labeled as “Del 1.”However, this effort suffers from the fact that, while the sequence alignments are shorter,our window size has remained the same, thus resulting in fewer windows across the align-ments. In our exploration of the Dss statistic behavior, we have noticed trends betweenwindow count and Power / Type I error (not shown) indicating that the “Del 1” regime isprobably anti-conservative. Thus, we perform another validation experiment in which wealso shrink the window size by the corresponding proportion; that is, if we removed 50%of the sites, we also shrink the window size by 50%. This is shown in Figure 3 as “Del 2.”Based on our experimentation with the relationship between window size and power (notshown), we believe this to be a conservative effort, and yet in all nine scenarios, we stillobtain higher Type I error probabilities under this regime than that of the synonymous Dssstatistic.Finally, for the scenarios with 50% synonymous substitutions, we perform one addi-tional set of experiments. Noting that in these scenarios, a nonsynonymous Dss statisticwould have, on average, the same loss of information as our synonymous Dss statistic, wethus create a nonsynonymous Dss statistic in an analogous manner to which we created oursynonymous Dss statistic, using labeled distances for nonsynonymous substitutions. Wethen run this nonsynonymous Dss statistic on the same set of data for each of the 50%synonymous substitution scenarios. In the high diversity case, we obtain a false positiveprobability of 19%, which is substantially higher than the synonymous Dss statistic’s falsepositive probability of 7%. For medium and low diversity, we obtain false positive proba-bilities of 13% and 7% respectively, which are identical to their respective estimated falsepositive probabilities from the synonymous Dss statistic.For our simulation studies, we set the number of bootstrap replicates to B = 100. Weare able to use B = 500 for the real data analyses, but it would have been prohibitively timeconsuming to do this for the simulations. Although the resulting accuracy of significancelevel thresholds may thus be of some concern, we found through a brief examination thatthe value of the 95% significance level threshold does not move substantially with B = 100on different parametric bootstrap runs with the same original datasets. Phylogenetic analyses of HIV sequences are useful in characterizing its transmission andspread, and these analyses are particularly relevant to elucidating the development of HIVdrug resistance (Lemey et al., 2005). However, while the typically high mutation ratesand short generation times for HIV are conducive towards a phylogenetic reconstruction,phylogenetic inference can be confounded by the high recombination rates, and selectivepressures imposed by antiretroviral therapy and the host immune system (Rambaut et al.,2004). Our method is the first to address the important issue of distinguishing betweenrecombination and convergent evolution, and thus we apply it here.Of particular interest are the pol and env genes of HIV-1, which are responsible forreplication (Hill et al., 2005) and cell entry (Coffin et al., 1997), respectively. These twogenes were studied through a transmission chain of nine Belgian HIV-positive patients(Lemey et al., 2005), in which it was found that a phylogenetic reconstruction using thesequenced env gene was compatible with the known transmission history among these nine9 l l l l l l l l l l l . . . . F a l s e P o s i t i v e P r obab ili t y % sy non y m ou s s ub s t i t u t i on s high diversity medium diversity low diversity l l l l l l l l l l l l . . . . F a l s e P o s i t i v e P r obab ili t y % sy non y m ou s s ub s t i t u t i on s l l l l l l l l l l l l . . . . Orig Del 1 Del 2 Syn Orig Del 1 Del 2 Syn Orig Del 1 Del 2 Syn F a l s e P o s i t i v e P r obab ili t y % sy non y m ou s s ub s t i t u t i on s Figure 3: False positive probability of each test under the convergent evolution scenario.Using the same branch length sets (diversity) and synonymous substitution proportions asin the recombination scenarios, we induce convergent evolution on the alignment instead ofa true recombination. “Orig” refers to the original Dss statistic; “Del 1” refers to the casein which we remove a proportion of substitutions corresponding to the non-synonymoussubstitution proportion (and thus keeping a proportion corresponding to the synonymoussubstitution proportion); “Del 2” is similar to “Del 1” except that we also shrink the windowsize by the corresponding proportion; “Syn” refers to the synonymous Dss statistic. Errorbars represent 95% confidence intervals based on the asymptotic binomial variance, usingthe observed false positive probability as ˆ p to obtain standard errors. In each scenario, 100simulated replications were analyzed. 10atients; on the other hand, the phylogenetic reconstruction using the pol gene sequenceswas not compatible with the transmission history. This raised the question of whetherselective pressures might be the cause of this incongruity.Specifically, Lemey et al. (2005) explored whether the selective pressure may have beendue to antiretroviral drug therapies applied to HIV-positive patients in the transmissionchain. They hypothesized that patients on similar antiretroviral drug treatments may invokeconvergent evolution on their HIV strains, due to the fact that their respective HIV virusesmay develop the same drug resistance-associated mutations. By examinating known drugresistance-associated mutations within the pol gene, they found this was in fact the casewith two of their individuals: “Patient A” and “Patient I.” That is, these two individualsshared specific amino acid substitutions that have been identified by the International AIDSSociety as being associated with clinical resistance to HIV antiretroviral drugs (Johnsonet al., 2003).Following this observation, Lemey et al. (2005) then constructed phylogenetic trees forthe pol gene based on synonymous distances and nonsynonymous distances separately, usingthe Syn-SCAN software (Gonzales et al., 2002). The synonymous tree was compatible withthe transmission history, while the nonsynonymous tree was not, and showed Patient A’sstrains clustering with those of Patient I. For an illustration, see Figure 4 from the workby Lemey et al. (2005). Thus, they concluded that the pol gene was under convergentselective pressure. Here, we revisit this question by examining the behavior of each Dssstatistic on a concatenated pol-env sequence alignment. That is, if we join the two sequencealignments together as one, will either recombination detection method indicate the presenceof intergenic or intragenic recombination?The dataset consists of nine individuals, with multiple samples taken longitudinally fromsome of them, for a total of 13 sequences. Results from our analyses are shown in Figure4. Our analysis used a window size of 636 nucleotides (or 212 codons) and a step size of 9nucleotides (or 3 codons). To assess statistical significance, we used a parametric bootstrapwith the number of replications set to B = 500. We observe that both the original Dssstatistic and the synonymous Dss statistic cross their 95% bootstrap significance thresholds,suggesting the presence of a recombination event, as opposed to convergent evolution. H. pylori tlpB gene
One of the most common diseases in the world is chronic gastritis. The human-adaptedmotile Gram-negative bacteria
Helicobacter pylori is the major causative agent of chronicgastritis, in addition to causing stomach and duodenal ulcers and gastric cancer, therebyinfecting about half of worlds populations (Feldman et al., 1998). Infection by
H. pylori is typically acquired by ingestion, with person-to-person transmission occurring most com-monly through vomit, saliva or feces (Feldman, 2001; Parsonnet et al., 1999). Due to theemergence of antibiotic-resistant strains, treatment of
H. pylori has begun to fail in roughly20–30% of cases (Graham, 2009), which points to the need for a better understanding ofthe evolutionary processes that drive
H. pylori diversity and survival.Here, we focus on tlpB , the gene that encodes the TlpB methyl-accepting chemotaxisprotein. This protein is crucial to
H. pylori ’s ability to colonize the stomach of its host, as itis responsible for its pH taxis, or movement in response to high acidity (Croxen et al., 2006).TlpB allows the bacterium to sense the pH of its surroundings, and move toward an optimalpH zone. Due to these functional roles, TlpB is a potential target of the immune response bythe infected host. Thus, recombination is a potential diversification mechanism that could11 . . . . . O r i g i na l D ss pol env
500 1000 1500 . . . . . S y non y m ou s D ss Nucleotide Position (Center of Window)
Figure 4:
Dss statistic landscapes for pol-env concatenation . Dotted horizonal linesrepresent the 95% significance level for each test, from a parametric bootstrap with B = 500.The red vertical lines represent the boundary between the two genes, with pol on the left,and env on the right.be used by TlpB to avoid elimination by the host’s immune response. To investigate this,we selected tlpB gene sequences from 38 completely sequenced H. pylori genomes of globallyrepresentative isolates. Recombination detection analyses by PhiPack (Bruen et al., 2006)involving three different statistics – Pairwise Homoplasy Index (p < χ (p=0.005) and Neighbor Similarity Score (p < H. pylori that shows extensive genomic diversity, itcould be a common scenario that an excess of convergent mutations created an illusion ofrecombination, thereby confusing traditional recombination detection algorithms. With thesame tlpB sequence alignment dataset from 38 isolates, our aim is to determine whether ourDss analysis concludes that there is truly a recombination signal, or whether it was actuallydue to convergent evolution.If recombination has indeed occurred, then we would expect that both the originalDss statistic and the synonymous Dss statistic would find a statistically significant signalfor recombination. In contrast, if the recombination signal is actually due to convergentevolution, then we would expect that the original Dss test and the nonsynonymous Dss testwould find a statistically significant signal for recombination, whereas our synonymous Dss12 . . . O r i g i na l D ss . . . S y non y m ou s D ss
400 600 800 1000 1200 1400 . . . . N on sy non y m ou s D ss Nucleotide Position (Window Center)
Figure 5: Dss statistic landscape for
H. pylori tlpB gene. Dotted horizonal line representsthe 95% significance level for each test, from a parametric bootstrap with B = 500.test would not, as in the HIV example.Here, we perform analyses with a window size of 600 nucleotides (200 codons), and stepsize of 9 nucleotides (3 codons). The sequence alignment was 1695 nucleotides (565 codons),yielding approximately 122 windows across the alignment. To obtain significance thresholdlevels, we use the parametric bootstrap with the number of replications set at B = 500.Results from our analyses are shown in Figure 5. We observe that neither the original Dssstatistic nor the synonymous Dss statistic crosses its respective 95% bootstrap significancethreshold. However, the original Dss statistic was very close, and in fact gave a bootstrapp-value of 0.058. In contrast, the synonymous Dss statistic did not come quite as close toits 95% bootstrap significance threshold, and gave a bootstrap p-value of 0.14.It might be a possibility that the lack of signal with the synonymous Dss statistic wasdue to a loss of power. However, we found that the nonsynonymous Dss statistic did cross its95% bootstrap significance threshold, with a bootstrap p-value of 0. Also, an examination ofthe sequence alignment revealed that there was approximately a 1.5:1 ratio of synonymoussubstitutions to nonsynonymous substitutions. Therefore, any loss of power observed withthe synonymous Dss statistic should also be observed with the nonsynonymous Dss statistic.More importantly, these results demonstrated that the recombination signal was drivenprimarily by nonsynonymous substitutions. Absence of any such signal with the synonymousDss statistic strongly suggests that the recombination signal in the nonsynonymous changeswas actually due to convergent evolution, most likely in response to adaptive selectionpressures. 13 .4 Data Analysis III: HCV HCV is an RNA virus that is estimated to infect roughly 3% of the human populationworldwide, and is a leading cause of liver disease and liver cancer (WHO, 2003). Overall,treatment success has been limited, and thus it has been recognized that a greater under-standing of the virus’ evolutionary behavior is crucial to effective prevention and treatmentof HCV infection (Gray et al., 2012). Indeed, specifically the matter of whether geneticrecombination occurs in HCV has important implications regarding resistance developmentagainst antiretroviral treatments used against the diseases that are caused by HCV infection(Morel et al., 2011).Similar to HIV, HCV mutates very rapidly within an infected host, which makes treat-ment difficult, but also should reveal patterns that will lead to a greater understandingof the link between the evolution of the virus and progression of disease (Okamoto et al.,1992; Smith et al., 1997). Curiously, however, there have been few reports of recombinationoccuring in HCV, despite the fact that recombination can be an important diversificationmechanism in positive sense RNA viruses (Gonz´alez-Candelas et al., 2011). One potentialreason is that simultaneous infection by two or more HCV types might be rare (Viazovet al., 2000; Tscherne et al., 2007). Additionally, it has been postulated that the viabilityof recombinant strains is poor in vivo (Prescott et al., 1997).Here, we investigate a sequence alignment of the hypervariable region 1 (HVR1) of HCV,from serial samples taken over 9.6 years from a single infected patient (Palmer et al., 2012).In the initial study, Palmer et al. (2012) found evidence of recombination between twoHVR1 subpopulations within the patient. However, this genetic region is subject to strongselective pressures as the envelope glycoprotein is targeted by the host antibody responses(von Hahn et al., 2007). Thus, we analyze this sequence alignment with our Dss statistics,to determine whether our analysis corroborates the original findings of Palmer et al. (2012),or whether this recombination signal is confounded by convergent evolution.We perform analyses with a window size of 144 nucleotides (48 codons) with a step sizeof 6 nucleotides (2 codons). The sequence alignment was 324 nucleotides long (108 codons),which gives 31 windows across the alignment. Results from our analyses are shown inFigure 6, with significance threshold levels obtained from the parametric bootstrap with B = 500. We observe that the original Dss statistic crosses its 95% bootstrap significancethreshold, whereas the synonymous Dss statistic does not. Additionally, the nonsynonymousDss statistic crosses its 95% bootstrap significance threshold, suggesting that the lack ofsignal with the synonymous Dss statistic was not simply due to a loss of power. This isfurther supported by an examination of the raw counts of synonymous vs. nonsynonymoussubstitutions in this sequence alignment, as there are in fact slightly more synonymoussubstitutions than nonsynonymous substitutions. Thus, we find statistically significantevidence that the recombination signal in HVR1 sequences is actually due to convergentevolution. In this work, we have introduced the synonymous Dss statistic, developed to give a statisticalmethod which allows us to distinguish between recombination and convergent evolution.Our simulations show that while our synonymous Dss statistic loses some power comparedto the original Dss statistic, it does have a lower false positive probability when the signalis due to convergent evolution. Furthermore, we provide some verification that this lower14 O r i g i na l D ss S y non y m ou s D ss
100 150 200 250 N on sy non y m ou s D ss Nucleotide Position (Window Center)
Figure 6: Dss statistic landscapes for HVR1 of HCV alignment. Dotted horizonal linesrepresent the 95% significance level for each test, from a parametric bootstrap with B = 500.false positive probability is not simply due to the loss of power, as suggested by the falsepositive probabilities of the various scenarios in which we remove a comparable portion ofthe information in the sequences, shown in Figure 3.Our real data analyses highlight the usefulness of our methodology when dealing withsituations where both recombination and convergent evolutionary may be participating inthe evolution of molecular sequences. We find evidence contrary to the conclusion by Lemeyet al. (2005) that convergent evolution has occurred in the pol gene lineage of HIV; instead,we find evidence for the occurrence of a recombination event. The benefit of using ourmethod is that we have created a statistical testing framework for addressing precisely thisquestion. In contrast, Lemey et al. (2005) examined phylogenies constructed with syn-onymous and nonsynonymous substitutions separately, basing their conclusion on whethereach phylogeny matched the known transmission chain. However, Lemey et al. (2005) haddifficulty providing a measure of statistical significance for their findings; in contrast, ourmethod naturally assigns statistical significance to our individual findings. Furthermore,Lemey et al. (2005) implicitly assumed that the entire pol gene had one evolutionary his-tory, and likewise for the env gene. If recombination had occurred within either gene, thenthis assumption would be violated. Similarly, Lemey et al. (2005) compared the synony-mous and nonsynonymous trees of the entire pol gene, while convergent evolution is mostlikely to be localized to a subset of sites, making it difficult to detect using distances basedon the whole alignment. In contrast, our sliding window method is able to separate thecontributions of synonymous and nonsynonymous substitutions to the local phylogenetic15ncongruence signal, which we believe to be an important advantage.Croxen et al. (2006) found that tlpB mutant strains were deficient in colonization due totheir inability to respond to the pH gradient. More recently, engineered mutational analysisshowed the disruption in urea-binding and thermal stabilization of the mutational variants(Sweeney et al., 2012). Also, the work demonstrated reduced chemotactic responses ofurea-binding variants to acid. These experimental results suggested the possibility thatthe natural mutational variations in the TlpB protein could arise from adaptive selectionpressures. While the gene showed recombination signals via traditional statistics, our novelapproach detected the signal to be due to the presence of convergent nonsynonymous (i.e.amino acid replacement) mutations. Such events of repeated, independent (i.e. phyloge-netically unlinked) accumulation of mutations at specific amino acid positions of encodedproteins represents powerful evidence of adaptive events (Christin et al., 2012; Tenaillonet al., 2012). Taken together, our results, on one hand, indicated the presence of adaptiveevolution of the H. pylori tlpB gene via convergent nonsynonymous mutations. On theother hand, this study depicted the promise of our approach to differentiate convergentmutational events from recombination.As noted by Gonz´alez-Candelas et al. (2011), there has been some debate regarding theoccurrence of recombination as a diversification mechanism in HCV. The reports of in vivo recombination have been questioned as being due to either PCR artifacts or misidentificationdue to convergent evolution. Here, we find evidence that the recombination signal found byPalmer et al. (2012) in their HVR1 sequence alignment is due to convergent evolution. Asthe occurrence of recombination in HCV continues to be called into question, our resultsside with the notion that empirical evidence of recombination of HCV sequences shouldbe interpreted with caution, because of a possibility of false positives due to convergentevolution.A fundamental question that one might ask is how our method is advantageous oversimply removing sites that contain nonsynonymous substitutions during a recombinationdetection analysis. An illustration of the answer can be observed by considering an align-ment containing a large number of sequences, in which multiple substitutions per site wouldnot be uncommon. Thus, if two substitutions had occurred at a particular site, then onesubstitutions could be synonymous and the other could be nonsynonymous. To use thebrute-force approach of removing sites that contain nonsynonymous substitutions wouldnecessarily remove the information contained in the synonymous substitution that had oc-curred at that site; that is, to remove the site means to remove the entire column from thesequence alignment, so all of the information contained in that site is lost. In contrast, ourapproach of counting synonymous substitutions under the framework laid out by O’Brienet al. (2009) removes the nonsynonymous mutation information in a more elegant manner,avoiding the total loss of information that would result from removing entire sites.A potential future development would be to create a coherent method to disintanglerecombination and convergent evolution without a convoluted three-way comparison, be-tween the original Dss statistic, the synonymous Dss statistic, and the nonsynoymous Dssstatistic. That is, in this study, we would conclude that there is evidence for recombinationif both the original Dss statistic and the synonymous Dss statistic show a positive signal.If the original Dss statistic shows a positive signal but the synonymous Dss statistic doesnot, then we would conclude that this is evidence of convergent evolution, further validatedif the nonsynonymous Dss statistic also showed a positive signal. It would be preferableif a methodology could produce one coherent statistic to evaluate in order to answer thisquestion, instead of two or three. 16inally, there is the potential that our concept could be implemented in other recombi-nation detection regimes, specifically those that are likelihood-based. It is well documentedthat sliding window recombination detection methodology, such as that of the Dss statistic,has drawbacks. For example, the behavior of the test statistic is somewhat influenced bythe window size chosen, and there are few guidelines on how to select this tuning parameter(McGuire and Wright, 2000). Also, a multiple comparisons issue exists, since each windowproduces a value of the test statistic. Although this issue is handled by considering onlythe maximum statistic value from the alignment and performing an appropriate paramet-ric bootstrap test for statistical significance, this strategy prevents estimating locations ofrecombination break-points with confidence. Thus, it may be advantageous to import ourconcept of synonymous recombination detection into a likelihood-based framework, such asthose proposed in (Husmeier and Wright, 2003), or in (Minin et al., 2005; Suchard et al.,2003).
Acknowledgements
We thank Adam Leach´e and Ken Rice for helpful comments and discussions. VNM wassupported by the National Science Foundation grant DMS-0856099 and the National Insti-tutes of Health grant R01-AI107034. VNM and EVS were supported by the NIH ARRAaward 1RC4AI092828. PL acknowledges funding from the European Research Council un-der the European Community’s Seventh Framework Programme (FP7/2007-2013) underERC Grant agreement no. 260864.
References
M. Anisimova, R. Nielsen, and Z. Yang. Effect of recombination on the accuracy of thelikelihood method for detecting positive selection at amino acid sites. Genetics, 164:1229–1236, 2003.P. Awadalla. The evolutionary genomics of pathogen recombination. Nature Reviews, 4:50–60, 2003.T.C. Bruen, H. Philippe, and D. Bryant. A simple robust statistical test for detecting thepresence of recombination. Genetics, 172:2665–2681, 2006.P.A. Christin, G. Besnard, E.J. Edwards, and N. Salamin. Effect of genetic convergence onphylogenetic inference. Molecular Phylogenetics and Evolution, 62:921–927, 2012.J.M. Coffin, S.H. Hughes, and H.E. Vamus. Retroviruses. Cold Spring Harbor LaboratoryPress, 1997.M.A. Croxen, G. Sisson, R. Melano, and P.S. Hoffman. The
Helicobacter pylori chemotaxisreceptor tlpB (HP0103) is required for pH taxis and for colonization of the gastric mucosa.Journal of Bacteriology, 188:2656–2665, 2006.R.A. Feldman. Epidemiologic observations and open questions about disease and infectioncaused by
Helicobacter pylori . Horizon Scientific Press, 2001.17.A. Feldman, A.J.P Eccersley, and J.M Hardie. Epidemiology of
Helicobacter pylori :acquisition, transmission, population prevalence and disease-to-infection ratio. BritishMedical Bulletin, 54:39–53, 1998.J.F. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2004.M.J. Gonzales, J.M. Dugan, and R.W. Shafer. Synonymous-non-synonymous mutationrates between sequences containing ambiguous nucleotides (Syn-SCAN). Bioinformatics,18:886–887, 2002.F. Gonz´alez-Candelas, F.X. L´opez-Labrador, and M.A. Bracho. Recombination in hepatitisC virus. Viruses, 3:2006–2024, 2011.D.Y. Graham. Efficient identification and evaluation of effective
Helicobacter pylori thera-pies. Clinical Gasteroenterology and Hepatology, 7:145–148, 2009.N.C. Grassly and E.C. Holmes. A likelihood method for the detection of selection andrecombination using nucleotide sequences. Molecular Biology and Evolution, 14:239–247,1997.R.R. Gray, M. Salemi, P. Klenerman, and O.G. Pybus. A new evolutionary model forHepatitis C virus chronic infection. PLoS Pathogens, 8:e1002656, 2012.M. Hill, G. Tachedjian, and J. Mak. The packaging and maturation of the HIV-1 Polproteins. Current HIV Research, 3:73–85, 2005.D. Husmeier and F. Wright. Detecting recombination in 4-taxa DNA sequence alignmentswith Bayesian hidden Markov models and Markov chain Monte Carlo. Molecular Biologyand Evolution, 20:315–337, 2003.V.A. Johnson, F. Brun-Vezinet, B. Clotet, B. Conway, R.T. D’Aquila, L.M. Demeter, D.R.Kuritzkes, D. Pillay, J.M. Schapiro, A. Telenti, and D.D. Richman. Drug resistancemutations in HIV-1. Topics in HIV Medicine, 11:215–221, 2003.P. Lemey, I. Derdelinckx, A. Rambaut, K. Van Laethem, S. Dumont, S. Vermeulen, andE. Van Wijngaerden. Molecular footprint of drug-selective pressure in a human immun-odeficiency virus transmission chain. Journal of Virology, 79:11981–11989, 2005.D.P. Martin, P. Lemey, and D. Posada. Analysing recombination in nucleotide sequences.Molecular Ecology Resources, 11:943–955, 2011.G. McGuire and F. Wright. TOPAL 2.0: improved detection of mosaic sequences withinmuliple alignments. Bioinformatics, 16:130–134, 2000.G. McGuire, F. Wright, and M.J. Prentice. A graphical method for detecting recombinationin phylogenetic data sets. Molecular Biology and Evolution, 14:1125–1131, 1997.I. Milne, F. Wright, G. Rowe, D.F. Marshall, D. Husmeier, and G. McGuire. TOPALi: soft-ware for automatic identification of recombinant sequences within DNA multiple align-ments. Bioinformatics, 20:1806–1807, 2004.V.N. Minin, K.S. Dorman, F. Fang, and M.A. Suchard. Dual multiple change-point modelleads to more accurate recombination detection. Bioinformatics, 21:3034–3042, 2005.18. Morel, C. Fournier, C. Fran¸cois, E. Brochot, F. Helle, G. Duverlie, and S. Castelain.Genetic recombination of the hepatitis C virus: clinical implications. Journal of ViralHepatitis, 18:77–83, 2011.R. Nielsen and Z. Yang. Likelihood models for detecting positively selected amino acid sitesand applications to the HIV-1 envelope gene. Genetics, 148:929–936, 1998.J.D. O’Brien, V.N. Minin, and M.A. Suchard. Learning to count: robust estimates forlabeled distances between molecular sequences. Molecular Biology and Evolution, 26:801–814, 2009.H. Okamoto, K. Kurai, S.I. Okada, K. Yamamoto, H. Iizuka, T. Tanaka, S. Fukuda, andF. Tsuda. Full length sequence of hepatitis C virus genome having poor homology toreported isolates: Comparative study of four distinct genotypes. Journal of GeneralVirology, 188:331–341, 1992.B.A. Palmer, I. Moreau, J. Levis, C. Harty, O. Crosbie, E. Kenny-Walsh, and L.J. Fanning.Insertion and recombination events at hypervariable region 1 over 9.6 years of hepatitisc virus chronic infection. Journal of General Virology, 93:2614–2624, 2012.J. Parsonnet, H. Shmuely, and T. Haggerty. Fecal and oral shedding of
Helicobacter pylori from healthy infected adults. The Journal of the American Medical Association, 282:2240–2245, 1999.D. Posada and K.A. Crandall. Evaluation of methods for detecting recombination from DNAsequences: computer simulations. Proceedings of the National Academy of Sciences, 98:13757–13762, 2001.D. Posada and K.A. Crandall. The effect of recombination on the accuracy of phylogeneticestimation. Journal of Molecular Evolution, 54:396–402, 2002.L.E. Prescott, A. Berger, J.M. Pawlotsky, P. Conjeevaram, I. Pike, and P. Simmonds.Sequence analysis of hepatitis C virus variants producing discrepant results with twodifferent genotyping assays. Journal of Medical Virology, 53:237–244, 1997.A. Rambaut, D. Posada, K.A. Crandall, and E.C. Holmes. The causes and consequences ofHIV evolution. Nature Reviews. Genetics, 5:52–61, 2004.M.H. Schierup and J. Hein. Consequences of recombination on traditional phylogeneticanalysis. Genetics, 156:879–891, 2000.D.B. Smith, S. Pathirana, F. Davidson, E. Lawlor, J Power, P.L. Yap, and P. Simmonds.The origin of hepatitis C virus genotypes. Journal of General Virology, 78:321–328, 1997.M.A. Suchard, R.E. Weiss, K.S. Dorman, and J.S. Sinsheimer. Inferring spatial phylogeneticvariation along nucleotide sequences: a multiple changepoint model. The Journal of theAmerican Statistical Association, 98:427–437, 2003.E. Goers Sweeney, J.N. Henderson, J. Goers, C. Wreden, K.G. Hicks, J.K. Foster,R. Parthasarathy, S.J. Remington, and K. Guillemin. Structure and proposed mecha-nism for the pH-sensing
Helicobacter pylori chemoreceptor tlpB. Structure, 20:1177–1188,2012. 19. Tenaillon, A. Rodriguez-Verdugo, R.L. Gaut, P. McDonald, A.F. Bennett, A.D. Long,and B.S. Gaut. The molecular diversity of adaptive convergence. Science, 335:457–461,2012.D.M. Tscherne, M.J. Evans, T. von Hahn, C.T. Jones, Z. Stamataki, J.A. McKeating, B.D.Lindenbach, and C.M. Rice. Superinfection exclusion in cells infected with hepatitis Cvirus. Journal of Virology, 81:3693–3703, 2007.S. Viazov, A. Widell, and E. Nordenfelt. Mixed infection with two types of hepatitis Cvirus is probably a rare event. Infection, 28:21–25, 2000.T. von Hahn, J.C. Yoon, H. Alter, C.M. Rice, B. Rehermann, P. Balfe, and J.A. McKeating.Hepatitis C virus continuously escapes from neutralizing antibody and T-cell responsesduring chronic infection in vivo. Gasterenterology, 132:667–678, 2007.D.B. Wake, M.H. Wake, and C.D. Specht. Homoplasy: from detecting pattern to determin-ing process and mechanism of evolution. Science, 331:1032–1035, 2011.WHO. WHO HCV: surveillance and control. , 2003. Accessed: 2014-08-11.Z. Yang. Computational Molecular Evolution. Oxford University Press, 2006.Z. Yang. PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology andEvolution, 24:1586–1591, 2007.Z. Zibert, E. Schreier, and M. Roggendorf. Antibodies in human sera specific to hyper-variable region 1 of hepatitis C virus can block viral attachment. Virology, 208:653–661,1995. 20 upplementary Figures
Original Dss P e r c en t i n ea c h b i n p−value Synonymous Dss p−value
Figure S-1: Distribution of p-values of the original and synonymous Dss tests when the dataare simulated under the null hypothesis of no recombination and no convergent evolution.
Original Dss F r equen cy p−value Synonymous Dss p−valuep−value