Viral population estimation using pyrosequencing
Nicholas Eriksson, Lior Pachter, Yumi Mitsuya, Soo-Yon Rhee, Chunlin Wang, Baback Gharizadeh, Mostafa Ronaghi, Robert W. Shafer, Niko Beerenwinkel
VViral population estimation using pyrosequencing
Nicholas Eriksson ,(cid:63) , Lior Pachter , Yumi Mitsuya , Soo-Yon Rhee , Chunlin Wang ,Baback Gharizadeh , Mostafa Ronaghi , Robert W. Shafer , and Niko Beerenwinkel Department of Statistics, University of Chicago Departments of Mathematics and Computer Science, UC Berkeley Division of Infectious Diseases, Department of Medicine, Stanford University Genome Technology Center, Stanford University Department of Biosystems Science and Engineering, ETH Z¨urich
Abstract.
The diversity of virus populations within single infected hosts presents a ma-jor difficulty for the natural immune response as well as for vaccine design and antiviraldrug therapy. Recently developed pyrophosphate based sequencing technologies (pyrose-quencing) can be used for quantifying this diversity by ultra-deep sequencing of virussamples. We present computational methods for the analysis of such sequence data andapply these techniques to pyrosequencing data obtained from HIV populations withinpatients harboring drug resistant virus strains. Our main result is the estimation of thepopulation structure of the sample from the pyrosequencing reads. This inference is basedon a statistical approach to error correction, followed by a combinatorial algorithm forconstructing a minimal set of haplotypes that explain the data. Using this set of explain-ing haplotypes, we apply a statistical model to infer the frequencies of the haplotypesin the population via an EM algorithm. We demonstrate that pyrosequencing reads al-low for effective population reconstruction by extensive simulations and by comparisonto 165 sequences obtained directly from clonal sequencing of four independent, diverseHIV populations. Thus, pyrosequencing can be used for cost-effective estimation of thestructure of virus populations, promising new insights into viral evolutionary dynamicsand disease control strategies.
Synopsis
The genetic diversity of viral populations is important for biomedical problems suchas disease progression, vaccine design, and drug resistance, yet it is not generally well-understood. In this paper, we use pyrosequencing, a novel DNA sequencing technique,to reconstruct viral populations. Pyrosequencing produces a large number of short,error-prone DNA reads. We develop mathematical and statistical tools to correct errorsand assemble the reads into the different viral strains present in the population. Weapply these methods to HIV-1 populations from drug resistant patients and show thatpyrosequencing produces results quite close to accepted techniques at a low cost andpotentially higher resolution. (cid:63)
Corresponding author: [email protected] a r X i v : . [ q - b i o . P E ] J a n EQUENCE READS alignmenterror correctionhaplotype reconstructionhaplotype frequency estimation
VIRUS POPULATION
Fig. 1.
Overview of viral population estimation using pyrosequencing. Sequence readsare first aligned to a reference strain, then corrected for errors, and assembled intohaplotype candidates. Finally, the relative frequencies of the reconstructed haplotypesare estimated in a ML fashion. These estimates constitute the inferred virus population.
Pyrosequencing is a novel experimental technique for determining the sequence of DNAbases in a genome [1,2]. The method is faster, less laborious, and cheaper than existingtechnologies, but pyrosequencing reads are also significantly shorter and more error-prone (about 100–250 base pairs and 5-10 errors/kb) than those obtained from Sangersequencing (about 1000 base pairs and 0 .
01 errors/kb) [3,4,5].In this paper we address computational issues that arise in applying this technologyto the sequencing of an RNA virus sample. Within-host RNA virus populations consistof different haplotypes (or strains) that are evolutionarily related. The population canexhibit a high degree of genetic diversity and is often referred to as a quasispecies, a con-cept that originally described a mutation-selection balance [6,7]. Viral genetic diversityis a key factor in disease progression [8,9], vaccine design [10,11], and antiretroviral drugtherapy [12,13]. Ultra-deep sequencing of mixed virus samples is a promising approachto quantifying this diversity and to resolving the viral population structure [14,15,16].Pyrosequencing of a virus population produces many reads, each of which originatesfrom exactly one—but unknown—haplotype in the population. Thus, the central prob-lem is to reconstruct from the read data the set of possible haplotypes that is consistentwith the observed reads and to infer the structure of the population, i.e., the relativefrequency of each haplotype.Here we present a computational four-step procedure for making inference aboutthe virus population based on a set of pyrosequencing reads (Fig. 1). First, the reads2re aligned to a reference genome. Second, sequencing errors are corrected locally inwindows along the multiple alignment using clustering techniques. Next, we assemblehaplotypes that are consistent with the observed reads. We formulate this problem asa search for a set of covering paths in a directed acyclic graph and show how the searchproblem can be solved very efficiently. Finally, we introduce a statistical model thatmimics the sequencing process and we employ the maximum likelihood (ML) principlefor estimating the frequency of each haplotype in the population.The alignment step of the proposed procedure is straightforward for the data an-alyzed here and has been discussed elsewhere [5]. Due to the presence of a referencegenome, only pairwise alignment is necessary between each read and the referencegenome. We will therefore focus on the core methods of error correction, haplotypereconstruction, and haplotype frequency estimation. Two independent approaches arepursued for validating the proposed method. First, we present extensive simulationresults of all the steps in the method. Second, we validate the procedure by recon-structing four independent HIV populations from pyrosequencing reads and comparingthese populations to the results of clonal Sanger sequencing from the same samples.These datasets consist of approximately 5000 to 8000 reads of average length 105bp sequenced from a 1kb region of the pol gene from clinical samples of HIV-1 popula-tions. Pyrosequencing (with the Roche GS20 platform [17]) can produce up to 200,000usable reads in a single run. Part of our contribution is an analysis of the interactionbetween the number of reads, the sequencing error rate and the theoretical resolutionof haplotype reconstruction. The methods developed in this paper scale to these hugedatasets under reasonable assumptions. However, we concentrate mainly on a samplesize (about 10,000 reads) that produces finer resolution than what is typically obtainedusing limiting dilution clonal sequencing. Since many samples can be run simultane-ously and independently, this raises the possibility of obtaining data from about 20populations with one pyrosequencing run.Estimating the viral population structure from a set of reads is, in general, anextremely hard computational problem because of the huge number of possible hap-lotypes. The decoupling of error correction, haplotype reconstruction, and haplotypefrequency estimation breaks this problem into three smaller and more manageabletasks, each of which is also of interest in its own right. The presented methods arenot restricted to RNA virus populations, but apply whenever a reference genome isavailable for aligning the reads, the read coverage is sufficient, and the genetic distancebetween haplotypes is large enough. Clonal data indicates that the typical variation inthe HIV pol gene is about 3-5% in a single patient [18]. We find that as populationsgrow more diverse, they become easier to reconstruct. Even at 3% diversity, we findthat much of the population is reconstructible using our methods.The pol gene has been sequenced extensively and (essentially) only one specificinsertion seems to occur: the “69 insertion complex,” which occurs under NRTI pressure[19]. None of our samples were treated with NRTIs, and the Sanger clones did notdisplay this (or any) indel. Therefore we assume throughout that there are no true3ndels in the population. However, the algorithms developed in this paper generalize ina straightforward manner for the case of true indels.The problem of estimating the population structure from sequence reads is similarto assembly of a highly repetitive genome [20]. However, rather than reconstructingone genome, we seek to reconstruct a population of very similar genomes. As such,the problem is also related to environmental sequencing projects, which try to assessthe genomes of all species in a community [21]. While the associated computationalbiology problems are related to those that appear in other metagenomics projects [22],novel approaches are required to deal with the short and error-prone pyrosequencingreads and the complex structure of viral populations. The problem is also similar to thehaplotype reconstruction problem [23], with the main difference being that the numberof haplotypes is unknown in advance, and to estimating the diversity of alternativesplicing [24].More generally, the problem of estimating diversity in a population from genomesequence samples, has been studied extensively for microbial populations. For example,the spectrum of contig lengths has been used to estimate diversity from shotgun se-quencing data [25]. Using pyrosequencing reads, microbial diversity has been assessedby counting BLAST hits in sequence databases [26]. Our methods differ from previouswork in that we show how to analyze highly directed, ultra-deep sequencing data usinga rigorous mathematical and statistical framework.
We have developed a computational and statistical procedure for inferring the structureof a diverse virus population from pyrosequencing reads. Our approach comprises fourconsecutive steps (Fig. 1), starting with the alignment of reads to a reference sequenceand followed by error correction, haplotype reconstruction, and haplotype frequencyestimation.
Given the high error rate of pyrosequencing, error correction is a necessary startingpoint for inferring the virus population. The errors in pyrosequencing reads typicallytake the form of one-base indels along with substitutions and ambiguous bases andoccur most often in homopolymeric regions. The reads come with quality scores foreach base quantifying the probability that the base is correct.Error rates with the Roche GS20 system have been estimated as approximately 5–10 errors per kb [4,5]. However, a small number of reads accounts for most of the errors.Thus after discarding approximately 10% of the reads (those with ambiguous bases oratypical length), the error can be reduced to 1–3 errors per kb [4]. These remainingerrors are about half insertions and a quarter each deletions and substitutions.Due to our assumption that there are no haplotypes with insertions in the popula-tion, the insertions in the reads can all be simply corrected through alignment with the4 X Fig. 2.
Error correction. Fixed-width windows (shown as the dashed box) over thealigned reads are considered. Two different types of reads are depicted (light versusdark lines) indicating their origin from two different haplotypes. Genetic differences(indicated by circles and squares) provide the basis for clustering reads into groupsrepresenting the respective haplotypes. After clustering, errors (marked as crosses) canbe corrected.reference genome. We do not deal with the problem of alignment here; it is straight-forward because our assumption of the existence of a reference genome implies thatonly pairwise alignment is necessary. In the remainder of the paper, we assume that acorrect alignment is given, leaving about 1 error per kb to correct (or 3 errors per kbif the low-quality reads are not aggressively pruned).Our approach for error correction resembles the method of [27] for distinguishingrepeats in whole genome shotgun assemblies combined with [28]. We consider all thereads in a window over the multiple alignment and cluster these reads using a statisticaltesting procedure to detect if a group of reads should be further split (Fig. 2). The readsin each cluster are then corrected to the consensus sequence of the cluster.The statistical testing procedure consists of two steps. First, every column in thewindow is tested for over-representation of a mutation using a binomial test. Second,every pair of columns is tested to see if a pair of mutations happen together more oftenthan would be expected by chance. See Section 4.1 for details on the tests.Any significant over-representation of a mutation or a pair in a window is regardedas evidence for the reads originating from more than one haplotype. The testing pro-cedure produces an estimate for the number of haplotypes in the window as follows.First, all single mutations are tested for significance; each significant mutation givesevidence for another haplotype in the window. Next, all pairs of mutations are tested;any significant pairs is evidence for another haplotype. However, this process can over-count the number of haplotypes in the window in certain cases if two mutations aresignificant both by themselves and as a pair. In this case, we correct for the over-count,see Section 4.1.We then separate the reads into k groups using k -means clustering. The algorithm isinitialized with both random clusters and clusters found by a divisive clustering methodbased on the statistical tests. We use the Hamming distance between sequences to5alculate cluster membership; the consensus sequences define the cluster centers. Theconsensus sequence is computed from weighted counts based on the quality scores.Thus, the inferred cluster centers are the reconstructed haplotypes. Combining testingand clustering we proceed as follows: Algorithm 1. (Local error correction)
Input:
A window of aligned reads.
Output:
The k haplotypes in the window and the error corrected reads. Procedure:
1. Find all candidate mutations and pairs of mutations and test for overrepresentation.2. Count the number of non-redundant mutations and pairs that are significant. Thisis the number k of haplotypes in the window.3. Cluster the reads in the window into k clusters and correct each read to its clustercenter.4. Output corrected reads. Applying a parsimony principle the algorithm finds the smallest number k of hap-lotypes that explain the observed reads in each window. The genomic region to beanalyzed is divided into consecutive windows and Algorithm 1 is run in each of them.We use three collections of successive windows that are shifted relative to each othersuch that each base in the region is covered exactly three times. The final correction ofeach base is the consensus of the three runs.The error correction procedure can lead to uncorrected errors or miscorrectionsvia false positives and negatives (leading to over/underestimation of the number ofhaplotypes in a window) or misclustering. See Section 4.1 for implementation detailsand a discussion of setting the parameters so as to minimize these mistakes.False positives arise when an error is seen as a significant variant; they will be con-sequences of setting the error rate too low or the significance level too high, or if errorsare highly correlated. Misclustering can happen if errors occur frequently enough on asingle read to make that read appear closer to an incorrect haplotype. This likelihood isincreased as the window size grows and more reads overlap the window only partially.An analysis of the false negative rate gives an idea of the theoretical resolution ofpyrosequencing. False negatives arise when a true variant tests as non-significant andthus is erased. If the input data was error-free, this would be the only source of mistakencorrections and would happen by eliminating rare variants. Given an error rate of 2.5errors per kb, the calculation in Section 4.1 shows that variants present in under 1% ofthe population would be erased on a dataset of 10,000 reads. In Section 2.2 below, wewill show that this number of reads is about enough to expect to resolve haplotypespresent in 1% of the population. Our approach to haplotype reconstruction rests on two basic beliefs. First, the haplo-types in the populations should not exhibit characteristics that are not present in theset of reads. This means that every haplotype in the population should be realizable6s an overlapping series of reads. Second, the population should explain as many readsas possible with as few haplotypes as possible.We assume a set R of aligned and error-corrected reads obtained from sequencinga population. If all haplotypes have the same length n , then each aligned read consistsof a start position in the genome and a string representing the genomic sequence. Wesay that two reads overlap if there are positions in the genome to which they are bothaligned. They agree on their overlap if they agree at all of these positions. We calla haplotype completely consistent with the set of reads R if the haplotype can beconstructed from a subset of overlapping reads of R that agree on their overlaps. Let C R be the set of all haplotypes that are completely consistent with R . In the following,we provide methods for constructing and sampling from C R and we present an efficientalgorithm for computing a lower bound on the number of haplotypes necessary toexplain the reads. Both techniques rely on the concept of a read graph. Definition 2. (Read graph)
The read graph G R associated with a set of reads R isthe acyclic directed graph with vertices {R irred , s, t } consisting of a source s , a sink t ,and one vertex for every irredundant read r ∈ R . Here, a read is redundant if there isanother read that overlaps it completely such that the two reads agree on their overlap.The edge set of G R is defined by including an edge from an irredundant read r to anirredundant read r , if1. r starts before r in the genome,2. r and r agree on their (non-empty) overlap, and3. there would not be a path in G R from r to r without this edge.Finally, edges are added from the source s to all reads beginning at position 1, and fromall reads ending at position n to the sink t . A path in the read graph from the source to the sink corresponds to a haplotypethat is completely consistent with R . Thus, finding C R , the set of completely consistenthaplotypes, amounts to efficiently enumerating paths in the read graph.For example, in Fig. 3 a simplified genome of length n = 8 over the binary alphabet { , } is considered, and an alignment of 20 reads (each of length 3) is shown in Fig. 3a.These data give rise to the read graph depicted in Fig. 3b. For instance, the haplotype is completely consistent with the reads and corresponds to the top path inthe graph. For more complex read graphs, see Fig. 6.We say that a set of haplotypes H is an explaining set for R if every read r ∈ R canbe obtained as a substring of some haplotype in H . We seek a small set of explaininghaplotypes and focus on the set C R , which consists exactly of those haplotypes thatemerge from the data. The following proposition provides a criterion for C R to be anexplaining set in terms of the read graph. Proposition 3.
The set of haplotypes completely consistent with a set of reads is anexplaining set for these reads if, and only if, every vertex of the read graph lies on adirected path from the source to the sink.
The Lander–Waterman model of sequencing is based on the assumptions that readsare random (uniformly distributed on a genome) and independent [29]. In this model,7
01 011 110 100 000 000000 000 001 011101011111 111 110111 111110101100s t (a) (b)
Fig. 3.
Read graph. A simplified genome of length n = 8 over the binary alphabet { , } is considered. Twenty reads of length 3 each are aligned to an assumed referencesequence (a). The induced read graph has 20 + 2 vertices and 28 edges (b).the probability that all bases of a genome of length n are sequenced follows the Poissondistribution p = (1 − e − c ) n , where c is the coverage (the total number of bases sequencedper position). For a sequencing experiment from a mixed population with differentabundances of haplotypes (or subspecies), a similar approach can be applied [22]. Forthe probability of complete coverage of all haplotypes occurring with a frequency of atleast ρ , we have p ≥ (1 − e − c · ρ ) n . Since c = N L/n , where N is the number of reads and L is the read length, sequencing N ≥ − n ln(1 − p /n ) ρ L (1)reads will ensure that the completely consistent haplotypes assembled from the readsare an explaining set for these haplotypes. For example, in order to cover all haplo-types of 5% or higher frequency of length 1000 bases with reads of length 100 with99% probability, at least 2302 reads need to be sequenced; to reach haplotypes at 1%frequency with 99% probability, 11,508 reads are needed. Notice that the number ofreads needed scales linearly with the the inverse of the smallest frequency desired. Wenote that the actual number of required reads can be much smaller in genomic regionsof low diversity.If Proposition 3 is violated, we can remove the violating set of reads to obtain anew set satisfying Proposition 3. This amounts to discarding reads that either containmistakes in the error correction or come from haplotypes that are at a too low frequency8n the population to be fully sequenced. Thus the resolution is inherently a function ofthe number of reads.We are now left with finding a minimal explaining set of completely consistenthaplotypes. Restricting to this subset of haplotypes reduces the computational demandof the problem significantly. Proposition 3 implies that an explaining set of completelyconsistent haplotypes is precisely a set of paths in the read graph from the source tothe sink, such that all vertices of the read graph are covered by at least one path. Wecall such a set of paths a cover of the read graph. The following result shows that aminimal cover can be computed efficiently (see Methods for a proof). Theorem 4. (Minimal cover of the read graph) (1)
Every minimal cover of the read graph has the same cardinality, namely the size ofthe largest set Q of vertices such that there are no paths between elements of Q . (2) A minimal cover of the read graph can be computed by solving a maximum matchingproblem in an associated bipartite graph. This matching problem can be solved intime at worst cubic in the number of irredundant reads.
The minimal path cover obtained from the maximum matching algorithm is ingeneral not unique. First, it provides a minimal chain decomposition of the graph.A chain in a directed acyclic graph is a set of vertices that all lie on at least onecommon path from the source to the sink. A chain can generally be extended to anumber of different paths. Second, the minimal chain decomposition itself is in generalnot unique. However, the cardinality of the minimal cover is well-defined. It is animportant invariant of the set of reads, indicating the smallest number of haplotypesthat can explain the data. Notice that the size of the minimal read graph cover canbe greater than the maximum number of haplotypes in a given window of the errorcorrection step. The cardinality of the minimal cover is a global invariant of the set ofreads.
Algorithm 5. (Construction of a minimal set of explaining haplotypes)
Input:
A set R of aligned, error corrected reads satisfying the conditions of Prop. 3. Output:
A minimal set of explaining haplotypes for R . Procedure:
1. Construct the read graph G R associated with R .2. Compute a minimal chain decomposition of the read graph.3. Extend the chains in the graph to paths from the source to the sink in G R .4. Output the set of haplotypes corresponding to the paths found in step 3. The algorithm can easily be modified to produce a non-minimal set by constructingmultiple chain decompositions and by choosing multiple ways to extend a chain to apath. We note that the set of all paths in the graph is generally much too large to beuseful. For example, the HIV datasets give rise to up to 10 paths and in simulationswe often found over 10 different paths in the graph. Generating paths from minimalexplaining sets is a reasonable way of sampling paths, as we will see in Section 2.4 (alsosee Fig. 6 and S1). 9 h Fig. 4.
Schematic representation of the sampling process. The virus population is rep-resented by five genomes (top) of two different haplotypes (light versus dark lines). Theprobability distribution is p = (3 / , / p , and reads are sampleduniformly from the haplotypes (bottom).Finally, if the condition of Proposition 3 is not satisfied, i.e., if the coverage is toolow and the set of completely consistent haplotypes does not contain an explaining set,then that condition can be relaxed. This corresponds to modifying the read graph byadding edges between all non-overlapping reads. Algorithm 5 will then again find aminimal set of explaining haplotypes. A virus population is a probability distribution on a set of haplotypes. We want toestimate this distribution from a set of observed reads. Let H be a set of candidatehaplotypes. In principle, we would like H to be the set of all possible haplotypes, butin practice we must restrict H to a smaller set of explaining haplotypes as derived fromAlgorithm 5 in order to make the estimation process feasible. Let R be the set of allpossible reads that are consistent with the candidate haplotypes in H . The read datais given as a vector u ∈ N R , where u r is the number of times that read r has beenobserved.Our inference is based on a statistical model for the generation of sequence readsfrom a virus population. Similar models have been used for haplotype frequency estima-tion [30,31,32]. We assume that reads are sampled as follows (Fig. 4). First, a haplotype h is drawn at random from the unknown probability distribution p = ( p h ) h ∈H . Second,a read r is drawn with uniform probability from the set of all reads with which thehaplotype is consistent. Estimating the structure of the population is the problem ofestimating p from u under this generative model.Let H be the hidden random variable with values in H that describes the haplotypeand R the observed random variable over R for the read. Then the probability ofobserving read r under this model isPr( R = r ) = (cid:88) h ∈H p h Pr( R = r | H = h ) , where the conditional probability is defined as Pr( R = r | H = h ) = 1 /K if h isconsistent with r , and 0 otherwise. Here K is the number of reads r ∈ R that h
10s consistent with. Since we assume that all haplotypes have the same length, K isindependent of both r and h .We estimate p by maximizing the log-likelihood function (cid:96) ( p , . . . , p |H| ) = (cid:88) r ∈R u r log Pr( R = r ) . This is achieved by employing an EM algorithm (see Section 4.3 for details). Eachiteration of the EM algorithm runs in time O ( |R||H| ). For example, for 5000 reads and200 candidate haplotypes, the EM algorithm typically converges within minutes on astandard PC.Software implementing the algorithms for error correction, haplotype reconstruc-tion, and frequency estimation is available upon request from the authors. We have simulated HIV populations of different diversities and then generated readsfrom these populations by simulating the pyrosequencing procedure with various errorrates and coverage depths. The first 1kb of the HIV pol gene was the starting point forall simulations. We separately analyze the performance first of error correction, thenof haplotype reconstruction, then of haplotype frequency estimation, and finally of thecombination of these three steps.The simulations show that Algorithm 1 reduces the error rate by a factor of 30.This performance is largely independent of the number of haplotypes in the population(Fig. S3). ReadSim [33] was used to simulate the error process of pyrosequencing. Theerror rate after alignment is about 1–3 errors per kb, so we are left with about 0.1 errorsper kb after error correction. As the population grows and becomes more diverse, thealignment becomes more difficult resulting in a smaller error reduction (Fig. S3).In order to assess the ability of Algorithm 5 to reconstruct 10 haplotypes from10,000 error-free reads (yielding about 1500 irredundant reads), we generate increasingnumbers of candidate haplotypes. This is achieved by repeatedly finding a minimal setof explaining haplotypes (see Section 2.2) until either we reach the desired number ofhaplotypes or we are unable to find more haplotypes that are part of a minimal explain-ing set. Fig. 5 visualizes the enrichment of recovered true haplotypes with increasingnumber of candidate haplotypes for different levels of population diversity. While in low-diversity populations exact haplotype reconstruction can be very challenging (Fig. 5a),the algorithm will always find haplotypes that are close to the true ones. For example,at 5% diversity 10 out of 50 candidate haplotypes will match the original 10 haplo-types at an average Hamming distance of just 1.6 amino acid differences (Fig. 5b).With larger populations, the performance is similar although more candidate haplo-types need to be generated (Fig. S2). Given that the total number of paths in the readgraphs considered here is over 3 · , the strategy of repeatedly finding minimal setsof explaining haplotypes is very efficient for haplotype reconstruction.Fig. 6 shows how the haplotype reconstruction problem gets harder at lower di-versity. In each graph, the bottom five lines correspond to reads matching one of the11 l l l l l l l l Number of constructed haplotypes P e r c en t c o rr e c t l l l l l l l l l Number of constructed haplotypes A v e r age H a mm i ng d i s t an c e f r o m c o rr e c t l Diversity3%5%7% (a) (b)
Fig. 5.
Haplotype reconstruction. Up to 300 candidate haplotypes were generated usingAlgorithm 5 from 10,000 error free reads drawn from populations of size 10 at varyingdiversity levels. Displayed are two measures of the efficiency of haplotype reconstruc-tion: the percent of the original haplotypes with exact matches among the reconstructedhaplotypes (a), and the average Hamming distance (in amino acids) between an originalhaplotype and its closest match among the reconstructed haplotypes (b).original five haplotypes uniquely. The sixth line on top (if present) corresponds to readswhich could come from several haplotypes. At 3% diversity (Subfigure (a)), only one ofthe haplotypes is reconstructed well. At 5% diversity (Subfigure (b)), the decompositionis almost correct except for a few small “crossovers”. At 7% diversity (Subfigure (c)),the chain decomposition exactly reconstructs the five haplotypes. By using multipledecompositions we can reconstruct many of the haplotypes correctly (Fig. S1) even inlow diversities.The performance of the EM algorithm for haplotype frequency estimation describedin Section 2.3 is measured as the Kullback–Leibler (KL) divergence between the originalpopulation p and its estimate ˆ p . We consider populations with 10 different haplotypes,each with frequency 1 /
10, at 5% diversity. Haplotype frequencies are estimated frombetween 500 and 6000 error-free reads (Fig. 7). The performance of the EM algorithm iscompared to that of a simple heuristic method, which assigns frequencies to the haplo-types in proportion to the number of reads they explain (see Methods, Section 4.4). Forboth methods, the KL divergence D KL ( p (cid:107) ˆ p ) decreases roughly exponentially with thenumber of reads. However, the EM algorithm significantly outperforms the heuristicfor all sizes of the read set and this improvement in prediction accuracy increases withthe number of reads.In order to test the combined performance of the haplotype reconstruction andfrequency estimation, our basic measure of performance is the proportion of the original12 a)(b)(c) Fig. 6.
Read graphs for 1000 reads from populations of 5 haplotypes at 3% (a), 5%(b), and 7% diversity (c). The bottom five lines in the graph correspond to reads whichmatch the five haplotypes uniquely; the top line in subfigures (a) and (b) contains readswhich match several haplotypes. In each subfigure, the reads are colored according toa single chain decomposition. 13 l l l l l l l l l l . . . . . Reads K L d i v e r gen c e l AlgorithmEMsimple
Fig. 7.
Haplotype frequency estimation. Haplotype frequencies were inferred using boththe EM algorithm in Section 2.3 (circles) and a simple heuristic algorithm (triangles);the resulting distance from the correct frequencies is measured using the KL divergence.Error bars give the interquartile range over 50 trials. The populations consisted of 10haplotypes at equal frequency and 5% diversity. The input to the algorithms was a setof reads simulated from the population and the original 10 haplotypes.population that is reconstructed within 10 amino acid differences. This measure, whichwe call ϕ , is defined as follows (see also Section 4.4). For each inferred haplotype,we determine the closest original haplotype and sum up the frequencies of all inferredhaplotypes that differ from their assigned original haplotypes by at most ten sites. Thisperformance measure indicates how much of the population has been reconstructedreasonably well. It is less sensitive to how well haplotypes and haplotype distributionsmatch (see Fig. 5 and 7 for those performance measures).For the first simulation of combined performance, we consider error-free reads frompopulations consisting of between 5 and 100 haplotypes, each with equal frequency, atdiversities between 3 and 8%. We simulated 10,000 error-free reads of average length100 from these populations and ran haplotype reconstruction and frequency estimation.Fig. 8 shows that performance increases as diversity increases and drops slightly asthe number of haplotypes increases. As we saw above, we would expect to be ableto reconstruct populations with size approximately 100 using 10,000 reads under theLander–Waterman model.For the second combined test, we tested all three steps: error correction, haplotypereconstruction, and frequency estimation. In order to model the miscorrection of errors,14 l l l l l . . . . . . Diversity P r opo r t i on c l o s e l Number of haplotypes5255075100
Fig. 8.
Combined population reconstruction procedure. The proportion of the popula-tion reconstructed within 10 amino acid differences ( ϕ , “Proportion close”) is shown.Here 10,000 error-free reads were sampled from populations with diversity between 3and 8% and with between 5 and 100 haplotypes of equal frequency. Haplotypes werereconstructed and then frequencies were estimated.we ran ReadSim [33] to simulate the actual error process of pyrosequencing and then ranerror correction. New, error-free reads were simulated and errors were added throughsampling from the distribution of the uncorrected errors in order to reach error rates ofexactly 0 . . .
2, a small performance loss asthe number of reads increases. This phenomenon appears to be related to the fact thatmore reads give rise to more paths in the graph, thereby increasing the chances that15 l l l l l l l l . . . . . .
0 errors per kb
Reads P r opo r t i on c l o s e ll l l l l l l l . . . . . . Reads P r opo r t i on c l o s e ll l l l l l l l . . . . . . Reads P r opo r t i on c l o s e l Diversity2.5%5%7%
Fig. 9.
Population reconstruction with errors. Proportion of population reconstructedwithin 10 amino acid differences ( ϕ , “Proportion close”) using haplotype reconstruc-tion and frequency estimation. The original populations had 10 haplotypes of equalfrequency at varying levels of diversity. Error was randomly introduced in the simu-lated reads to mimic various levels of error correction.completely consistent haplotypes that contain errors are assigned positive probabilities.In fact, the size of a minimal path cover increases approximately linearly with thenumber of reads and this increase does not appear to depend much on populationdiversity (Fig. S4). Our second evaluation of population reconstruction is based on ultra-deep sequenc-ing of drug-resistant HIV populations from four infected patients [5]. The four viruspopulations were analyzed independently using pyrosequencing and clonal Sanger se-quencing. Table 1 shows the resulting statistics on the datasets. The pyrosequencingbased approach mirrors very closely the clonal sequencing. To compare the populationsinferred from pyrosequencing to the clonal sequences, we use the measure ϕ whichindicates the percent of the inferred population that matches a clonal sequence withinone amino acid difference. This is used instead of ϕ used in Section 2.4 to providea more sensitive performance measure. In all samples, at least 51.8% of the inferredpopulations were within one amino acid difference of a clonal haplotype. Based on thepresent data, we cannot decide whether the additional inferred haplotypes went un-detected by the Sanger sequencing, or if they are false positives of the reconstructionmethod.We found many additional haplotypes in our analysis of the most complex sample,V11909. Table 2 shows a comparison between the inferred population for V11909 andthe clonal haplotypes. The populations were analyzed at 15 positions in the proteaseassociated with drug resistance, taken from the HIV Drug Resistance Database [34].16 yrosequencing Haplotype reconstruction Comparison to clonal seq.Sample Reads Irred ρ Gaps/kb Err/kb Min. cover Diversity Clones Avg. dist. ϕ V11909 5177 641 2.2 3.3 1.10 22 15.8 65 1.81 51.8V54660 7777 228 1.5 2.3 1.67 4 1.0 32 0.34 99.6V3852 4854 227 2.4 3.4 1.33 7 1.4 42 0.29 100V2173 6304 354 1.8 2.3 1.31 4 2.3 26 0.81 86.6
Table 1.
Population reconstruction from four HIV samples. The first four columnsdescribe the pyrosequencing data: the number of reads, the number of irredundantreads (see Theorem 4), the expected frequency (in percent) of the least frequent hap-lotype we can expect to cover with 99% confidence, and the number of gaps/kb in thealigned data. The next three columns describe the reconstruction algorithm: the num-ber of non gap characters changed in error correction, the size of a minimal explainingset of haplotypes, and the diversity, measured as the expected number of amino aciddifferences among the estimated population. After error correction, reads were trans-lated into amino acids. The final 3 columns describe the validation using (translated)clonal sequences: the number of clones sequenced, the average distance between theestimated population and the closest Sanger haplotype, and the percent ( ϕ ) of theestimated population that was close (up to 1 amino acid difference) to a clone.All but 4 of the 65 clonal haplotypes (6.1%) are matched in the inferred population,and the frequencies in the inferred population are a reasonable match to the frequenciesof the mutation patterns in the clonal haplotypes. Using the Lander–Waterman model,we find that the pyrosequencing reads obtained from the HIV samples are enough toreconstruct with 99% probability all haplotypes that occur at a frequency of at least2.2% (Eq. 1, Tab. 1). By comparison, the Sanger sequencing approach yielded 65 clonalsequences, 37 of which were mixtures of two or more clones. Pyrosequencing constitutes a promising approach to estimating the genetic diversity ofa community. However, sequencing errors and short read lengths impose considerablechallenges on inferring the population structure from a set of pyrosequencing reads.We have approached this task by identifying and solving consecutively three computa-tional problems: error correction, assembly of candidate haplotypes, and estimation ofhaplotype frequencies. Our methods focus on the situation where a reference genome isavailable for the alignment of reads. This is the case, for example, for many importantpathogens, such as bacterial and viral populations.The procedure consists of three steps. First, error correction is performed locally.We take windows of fixed width over the aligned reads and cluster reads within thewindows in order to resolve the local haplotype structure. This approach is a combina-tion of methods [27,28] specially tailored to pyrosequencing reads. Next, haplotypes arereconstructed using a new application of a classic combinatorial algorithm. This step is17 requencySanger Pyro Mutations52.3 19.3 M46I, I54V, G73I, I84V, L90M12.3 19.0 M46I, I54V, G73S, I84V, L90M9.2 9.4 M46I6.2 5.64.6 7.1 M46I, I54V, G73S, L90M4.6 1.9 M46I, I54V, G73I, L90M3.1 5.8 M46I, G73I, I84V, L90M3.1 0.0 L33F, M46I, I54V, G73S, I84V, L90M1.5 1.9 M46I, L90M1.5 0.0 L33F, M46I, I54V, G73I, I84V, L90M1.5 0.0 M46I, I54V, G73N, I84V, L90M0.0 4.9 M46I, G73I0.0 4.7 M46I, I84V0.0 4.0 M46I, G73S, I84V, L90M0.0 3.1 M46I, I54V, G73S, V82I, I84V, L90M0.0 2.9 M46I, I54V, G73I, I84V0.0 2.9 M46I, I50V, I54V, G73I, I84V, L90M0.0 2.0 I84V0.0 1.4 I54V, G73I, I84V, L90M0.0 1.2 M46I, I50V, I54V, G73S, L90M0.0 1.1 M46I, I54V0.0 1.0 M46I, I50V, I84V, L90M0.0 0.5 M46I, I54V, G73I0.0 0.3 G73S, I84V, L90M
Table 2.
Comparison of the patterns of resistance mutation for the 65 Sanger sequencesand the estimated population for sample V11909. Mutation patterns were restricted to15 positions in the protease (amino acids 23, 24, 30, 32, 33, 46, 48, 50, 53, 54, 73, 82,84, 88, and 90) associated with PI resistance.18he main theoretical advance in this paper. Finally, haplotype frequencies are inferredas the ML estimates of a statistical model that mimics the pyrosequencing process. Wehave developed an EM algorithm for solving this ML problem.Haplotype reconstruction is based on two assumptions: consistency and parsimony.We require that each haplotype be constructible from a sequence of overlapping readsand that the set of explaining haplotypes be as small as possible. The Lander–Watermanmodel of sequencing implies lower bounds on the number of reads necessary to meetthe first requirement. The fundamental object for haplotype reconstruction is the readgraph. A minimal set of explaining haplotypes corresponds to a minimal path coverin the read graph, and this path cover can be found efficiently using combinatorialoptimization. Moreover, the cardinality of the minimal path cover is an importantinvariant of the haplotype reconstruction problem related to the genetic diversity ofthe population.We believe that these methods are also applicable to many metagenomic projects.In such projects, estimation of the diversity of a population is a fundamental question.The size of a minimal cover of a fragment assembly graph provides an intuitive andcomputable measure of this diversity.We have validated our methods by extensive simulations of the pyrosequencingprocess, as well as by comparing haplotypes inferred from pyrosequencing data to se-quences obtained from direct clonal sequencing of the same samples. Our results showthat pyrosequencing is an effective technology for quantitatively assessing the diversityof a population of RNA viruses, such as HIV.Resistance to most antiretroviral drugs is caused by specific mutational patternscomprising several mutations, rather than one single mutation. Thus, an importantquestion that can be addressed efficiently by pyrosequencing is which of the resistancemutations actually occur on the same haplotype in the population. Since our methodsavoid costly clonal sequencing of the HIV populations for determining the co-occurrenceof HIV resistance mutations [35], pyrosequencing may become an attractive alternativeto the traditional clonal Sanger sequencing.The sample size of approximately 10,000 reads we have considered provides us withthe opportunity of detecting variants present in only 1% of the population. Pyrose-quencing can produce 200,000 reads and thus twenty populations could be sequencedto a good resolution using a process less labor intensive than a limiting dilution clonalsequencing to a similar resolution of a single population.The simulations suggest that the method works best with populations that aresuitably diverse. Intuitively, the information linking two reads together on the samehaplotype decays rapidly in sections of the genome where there are few identifyingfeatures of that haplotype (as in a region of low diversity). In particular, repeats ofsufficient length in the reference genome can completely destroy linkage information.However, at some point the benefits of increased diversity will be partially reduced bythe increased difficulty of the alignment problem. With more diverse populations ortrue indels, alignment to a single reference genome will become less accurate.19he HIV pol gene analyzed here is on the low end of the diversity spectrum. The env gene with its higher variability may be a better target for some applications. We expectthe proposed methods to improve early detection of emerging drug resistant variants[36,37], and to support the genetic and epidemiological study of acute infections, inparticular the detection of dual infections [38].Since our computational procedure produces an estimate of the entire virus popula-tion, it allows the study of fundamental questions about the evolution of viral popula-tions in general [39]. For example, mathematical models of virus evolution can be testeddirectly within the accuracy of estimated viral haplotype frequencies [40]. Predictingviral evolution is considered an important step in HIV vaccine development [10].In addition to the promising biological applications, there are many interestingtheoretical questions about reconstructing populations from pyrosequencing data. Theerrors in pyrosequencing reads tend to be highly correlated, as they occur predominatelyin homopolymeric regions. While this can make correction more difficult (a fact whichcan be counteracted by the use of quality scores), we believe that it can make haplotypereconstruction more accurate than if the errors were uniform. If errors are isolated to afew sites in the genome, fewer additional explaining haplotypes are needed than if theerrors were distributed throughout. The exact relationship between the error processof pyrosequencing, error correction, and haplotype reconstruction is worthy of furtherstudy.As pyrosequencing datasets can contain 200,000 reads, it is worthwhile to investigatehow our methods scale to such large datasets. Haplotype reconstruction is the onlystep which is not immediately practical on such a large number of reads, since it isat worst cubic in the number of irredundant reads. However, problems of this size areapproachable with our methods as follows.The theoretical resolution of the algorithms depends on two factors: first, the abilityto differentiate between errors and rare variants; and second, whether there are enoughreads so that we can assemble all haplotypes. We have seen that the number of readsnecessary for assembly scales with the inverse of the desired resolution (Eq. 1): if N reads cover all haplotypes of frequency at least ρ , then kN reads are needed to coverall haplotypes of frequency at least ρ/k . However, the resolution of error correction isat most the overall error rate as the number of reads grows; see Table 3.The limited resolution of error correction combined with the elimination of redun-dant reads makes haplotype reconstruction feasible for large datasets. For example,error correction on 200,000 reads with (cid:15) = 0 . α = 0 .
001 will erase all variantswith frequency below 0 . ≈ / . umber of reads 1000 5000 10,000 50,000 100,000 200,000Error resolution(%) 3 1.2 0.9 0.5 0.42 0.365Reconstruction resolution (%) 11.5 2.3 1.2 0.23 0.12 0.058 Table 3.
Resolution of the error correction (binomial test only) and haplotype re-construction as a function of number of reads. The resolution of the error correction isdefined as the smallest frequency of a mutation that will be visible over the backgrounderror rates; it is calculated with error rate (cid:15) = 0 . α = 0 . ,
000 reads the error correction is thelimiting factor.Current and future improvements to pyrosequencing technology will lead to longerreads (250 bp), more reads, and lower errors. However, in order for huge numbersof reads to be of great help in the ultra-deep sequencing of a population, the errorrates must also decrease. The performance of our methods as read length varies isan important question, given the availability of sequencing technologies with differentread lengths (e.g., Solexa sequencing with 30–50 bp reads) and the desire to assemblehaplotypes of greater size (e.g., the 10 kb entire HIV genome).Notice that haplotype reconstruction seems to be quite good locally (cf. Fig. 6 andS1) in that many reconstructed haplotypes contain large contiguous regions where theyagree with a real haplotype. However, the measures of performance considered in thispaper all deal with the entire haplotype and ignore partial results of this type. Thiswould seem to imply that longer reads will improve the reconstruction performance ona fixed length genome; new performance measures will have to be developed to analyzethese problems.
We use two statistical tests for locally detecting distinct haplotypes. The first testanalyzes each column of the multiple alignment window. Write d for the number of readsthat overlap this window. We ask if the observed number of mutations (deviations fromthe consensus base) exceeds our expectation under the null hypothesis of one haplotypeand a uniform sequencing error (cid:15) . The probability of observing x or more mutations isgiven by the binomial distributionPr( X ≥ x ) = d (cid:88) k = x (cid:18) dk (cid:19) (cid:15) k (1 − (cid:15) ) d − k . (2)21here are two parameters to set here: the error rate (cid:15) and the p -value α that is requiredfor significance.Next, we test pairs of mutations in two different alignment columns u and v usingFisher’s exact test. The test statistic is the number C of co-occurrences, which underthe null hypothesis of one haplotype follows the hypergeometric distributionPr( C = c ) = (cid:0) n v c (cid:1)(cid:0) d − n v n u − c (cid:1)(cid:0) dn u (cid:1) , (3)where n u and n v are the number of times the specific mutations have been observedin columns u and v , respectively [28]. Considering pairs provides more power if co-occurrences are observed on reads, but cannot detect single mutation differences. Weset the p -value for this test to be the same as for the binomial test.The procedure tests all columns in the window using (2) and then all pairs ofcolumns using (3). This can lead to over-counting of the number of haplotypes asfollows. Suppose that in columns 1 and 2 the consensus base is A , but that there isa mutation C in some of the reads in each column. If both mutations are significantby themselves, this is evidence of three haplotypes in the window. If they are alsosignificant together, this would be evidence of four haplotypes. However, there couldbe only two true haplotypes at these two positions: AA and CC . To correct for this, wesubtract two from the count whenever two significant mutations are significant togetherand always occur on exactly the same set of reads.We do not explicitly address the multiple comparisons problem associated with thistesting procedure here and regard the significance levels of the tests as parameters ofAlgorithm 1. We account for the quality scores associated with each base by using(rounded) weighted counts in the test statistics. Gaps are treated as unknown basesand represented by a special character with quality score zero. We found that an errorrate of 0 . p -value of 0 . p -value for the tests and the error rate should be adjusted to preventfalse positives and negatives. The number of mutations required in a column beforethe mutation is considered significant can be calculated from (2). For example, with10,000 reads of length 100 in a genome of length 1000, there will be approximately d = 1000 reads overlapping a small window. Setting (cid:15) = 0 . α = 0 . / ≈
1% of the population. Notice that this is quite similar to theestimate under the Lander–Waterman model (Eq. 1), where 11,508 reads are neededto cover all haplotypes at 1% frequency.Notice that the power of the error correction grows very slowly, see Table 3. On adataset with 200,000 reads, then error correction would eliminate any variants present22n less than 0 . ,
000 reads are needed to achievethis resolution with haplotype reconstruction.
Suppose the read graph G R has V vertices and E edges. Since G R is acyclic, it defines apartial order on the set of irredundant reads, R irred . Part (1) is then a direct applicationof Dilworth’s theorem [41] to this partially ordered set.The associated bipartite graph has vertex set { A, B } , where both A and B are equalto R irred . There is an edge between r ∈ A and s ∈ B if there is a path from r to s in G R . Then a maximal matching in the bipartite graph is equivalent to a minimal chaindecomposition of G R (see [42]).For the time complexity, notice that building the read graph G R is of complexity O ( V ). Building the associated bipartite graph is equivalent to finding the transitiveclosure of the read graph and thus is O ( V E ). The efficient matching algorithm forthe solution of the matching problem is due to Hopcroft and Karp [43]. For a gen-eral bipartite graph with V (cid:48) vertices and E (cid:48) edges, the Hopcroft-Karp algorithm is oftime complexity O ( E (cid:48) √ V (cid:48) ). Since in our construction, V (cid:48) = 2 V and E (cid:48) = O ( V ), thematching algorithm takes time O ( V / ). Depending on the structure of the graph, eitherthe transitive closure or matching problems can dominate, but both are of complexity O ( V ). We use an EM algorithm [44] to estimate the maximum likelihood haplotype frequen-cies. We iteratively estimate the missing data u rh , i.e., the number of times read r originated from haplotype h , and solve the easier optimization problem of maximizingthe log-likelihood of the hidden model (cid:96) hid ( p , . . . , p |H| ) = (cid:88) r ∈R (cid:88) h ∈H u rh log ( p h Pr( R = r | H = h )) . In the E step, the expected values of the missing data are computed as u rh = u r p h Pr( R = r | H = h )Pr( R = r ) . In the M step, maximization of (cid:96) hid yieldsˆ p h = (cid:80) r ∈R u rh (cid:80) r ∈R u r . Our starting point is the first 1kb of the wild type sequence of the HIV pol gene, encod-ing the 99 amino acids of the protease and the beginning of the reverse transcriptase.23andom mutations are introduced into this strain in order to generate genetic diversity.We generated various populations in this way with diversities between 20 and 80 basepairs (2–8%). All haplotypes were set to have the same frequency in the population.We report the expected value of the Hamming distance between two haplotypesdrawn from a population as our basic measure of the diversity of a population. Thisstatistic, which we call simply “diversity” can be thought of as a version of the Simpsonmeasure [45] that takes into account the genetic structure.We use ReadSim [33] (available from ) to simulate the error process of pyrosequenc-ing. We generate reads by running ReadSim with the options -meanlog 0.15 -sigmalog0.08 -filter (aside from these options, we use the default parameters). This processresults in about 7 insertions and 3 deletions per kb. Since pyrosequencing produceslight coverage on the tails of the input genomes, we simulate by padding the region ofconcern with 100 nucleotides on each end and discard reads from the tails. The errorcorrection (Algorithm 1) is run with window size of 24, p -value of 0.001, and error rate (cid:15) = 0 . .
1, and 0 . p h < − to zero.We also test a simple alternative to the EM algorithm as follows. For each hap-lotype h ∈ H , let c h count how many reads haplotype h is consistent with and set p h = c h / (cid:80) l ∈H c l . This estimate will be correct under the given model if each read isconsistent with exactly one haplotype.For evaluating the performance of the various steps of our reconstruction method,we use several basic measures of performance. To measure the distance between two setsof haplotypes (one original and one inferred), we calculate how many of the originalhaplotypes are found among the inferred haplotypes as well as the average of thedistances between each inferred haplotype and its closest original haplotype (Fig. 5).Distance is measured as Hamming distance on the amino acid level. To compare twopopulations with different frequencies but the same haplotypes, we use the Kullback–Leibler (KL) divergence D KL ( p (cid:107) q ) = (cid:80) h ∈H p h log( p h /q h ), where p and q are the twodiscrete (haplotype) distributions with the same support H (Fig. 7). To measure theperformance of the entire process, we measure how much of the inferred population isclose to the original population. Specifically, we calculate the percent of the inferredpopulation that is within a specified distance from one of the original haplotypes (cf.Fig. 9). We refer to this statistic as ϕ n , where n is the number of amino acid differenceswe allow. 24 .5 HIV sequence data Virus populations derived from four treatment-experienced patients between 2000 and2005 were sequenced using both pyrosequencing and limiting dilution Sanger sequenc-ing. The plasma HIV-1 RNA levels in the four plasma samples were each greater than100,000 copies/ml as determined using the VERSANT HIV-1 RNA assay [46]. Eachsequence encompassed all 99 HIV-1 protease codons and the first 241 reverse transcrip-tase codons. The same genomic region of the same four samples was analyzed usinglimiting dilution and direct Sanger sequencing of the clones. Sample preparation andpyrosequencing and Sanger sequencing techniques are explained in detail in [5].Briefly, ultra-deep pyrosequencing was performed on four RT-PCR products fromRNA extracted from cryopreserved plasma samples. The median number of cDNAcopies prior to sequencing was 100 with an interquartile range of 75 to 180. The re-sulting datasets consisted of between 4854 and 7777 reads of average length 105 bp.Reads were error corrected (Algorithm 1) and translated to amino acids. For haplotypereconstruction, Algorithm 5 was run repeatedly until all or at most 10,000 candidatehaplotypes were found. The samples were translated into amino acids after the errorcorrection step; thus, the haplotype reconstruction and frequency estimation algorithmsare done on the amino acid level.For the sample with the greatest diversity (V11909), the unamplified cDNA productwas serially diluted prior to PCR amplification. Bidirectional sequencing was performeddirectly on 37 amplicons derived from the 1/30 cDNA dilutions and 31 amplicons de-rived from the 1/100 cDNA dilutions. Three sequences were discarded because of in-complete coverage. For the other three samples, we used the Sanger method to sequencea total of 32, 42, and 26 plasmid subclones per sample.Some of the sequences obtained from limiting dilutions contained mixtures of severalclones. In this case, in order to measure the Hamming distance between an inferredhaplotype and a clonal haplotype with ambiguous bases, we used the minimum distanceover all possible translations of the ambiguous haplotype.
Acknowledgments
N. Eriksson and L. Pachter were partially supported by the NSF (grants DMS-0603448and CCF-0347992, respectively). N. Beerenwinkel was funded by a grant from the Bill& Melinda Gates Foundation through the Grand Challenges in Global Health Initiative.
References
1. Fakhrai-Rad H, Pourmand N, Ronaghi M (2002) Pyrosequencing: an accurate detection platformfor single nucleotide polymorphisms. Hum Mutat 19:479–485. doi:10.1002/humu.10078. URL http://dx.doi.org/10.1002/humu.10078 .2. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. (2005) Genome sequencing inmicrofabricated high-density picolitre reactors. Nature 437:376–380. doi:10.1038/nature03959.URL http://dx.doi.org/10.1038/nature03959 . . Malet I, Belnard M, Agut H, Cahour A (2002) From RNA to quasispecies: A DNA polymerasewith proofreading activity is highly recommended for accurate assessment of viral diversity. J VirolMethods 109:161–170.4. Huse S, Huber J, Morrison H, Sogin M, Welch D (2007) Accuracy and quality of massively-parallelDNA pyrosequencing. Genome Biology 8.5. Wang C, Mitsuya Y, Gharizadeh B, Ronaghi M, Shafer RW (2007) Characterization of muta-tion spectra with ultra-deep pyrosequencing: Application to HIV-1 drug resistance. Genome Res17:1195–1201.6. Eigen M, McCaskill J, Schuster P (1989) The molecular quasi-species. Adv Chem Phys 75:149–263.7. Domingo E, Holland JJ (1997) RNA virus mutations and fitness for survival. Annu Rev Microbiol51:151–178. doi:10.1146/annurev.micro.51.1.151. URL http://dx.doi.org/10.1146/annurev.micro.51.1.151 .8. Nowak MA, Anderson RM, McLean AR, Wolfs TF, Goudsmit J, et al. (1991) Antigenic diversitythresholds and the development of AIDS. Science 254:963–969.9. Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, et al. (1999) Consistent viralevolutionary changes associated with the progression of human immunodeficiency virus type 1infection. J Virol 73:10489–10502.10. Gaschen B, Taylor J, Yusim K, Foley B, Gao F, et al. (2002) Diversity considerations in HIV-1vaccine selection. Science 296:2354–2360. doi:10.1126/science.1070441. URL http://dx.doi.org/10.1126/science.1070441 .11. Douek DC, Kwong PD, Nabel GJ (2006) The rational design of an AIDS vaccine. Cell 124:677–681.doi:10.1016/j.cell.2006.02.005. URL http://dx.doi.org/10.1016/j.cell.2006.02.005 .12. Beerenwinkel N, Sing T, Lengauer T, Rahnenf¨uhrer J, Roomp K, et al. (2005) Computational meth-ods for the design of effective therapies against drug resistant HIV strains. Bioinformatics 21:3943–3950. doi:10.1093/bioinformatics/bti654. URL http://dx.doi.org/10.1093/bioinformatics/bti654 .13. Rhee SY, Liu TF, Holmes SP, Shafer RW (2007) HIV-1 subtype B protease and reverse transcrip-tase amino acid covariation. PLoS Comput Biol 3:e87. doi:10.1371/journal.pcbi.0030087. URL http://dx.doi.org/10.1371/journal.pcbi.0030087 .14. O’Meara D, Wilbe K, Leitner T, Hejdeman B, Albert J, et al. (2001) Monitoring resistance tohuman immunodeficiency virus type 1 protease inhibitors by pyrosequencing. J Clin Microbiol39:464–473. doi:10.1128/JCM.39.2.464-473.2001. URL http://dx.doi.org/10.1128/JCM.39.2.464-473.2001 .15. Simons J, Egholm M, Lanza J, Desany B, Turenchalk G, et al. (2005) Ultra-deep sequencing ofHIV from drug resistant patients. Antivir Ther 10:S157.16. Tsibris A, Russ C, Lee W, Paredes R, Arnaout R, et al. (2006) Detection and quantification ofminority HIV-1 env V3 loop sequences by ultra-deep sequencing: preliminary results. Antivir Ther11:S74.17. Roche Applied Sciences (2006) GS20 Data Processing Software Manual. Penzberg, Germany:Roche Diagostics GmbH.18. Bacheler L, Jeffrey S, Hanna G, D’Aquila R, Wallace L, et al. (2001) Genotypic correlates ofphenotypic resistance to efavirenz in virus isolates from patients failing nonnucleoside reversetranscriptase inhibitor therapy. J Virol 75:4999–5008. doi:10.1128/JVI.75.11.4999-5008.2001. URL http://dx.doi.org/10.1128/JVI.75.11.4999-5008.2001 .19. Johnson VA, Brun-Vezinet F, Clotet B, Kuritzkes DR, Pillay D, et al. (2006) Update of the drugresistance mutations in hiv-1: Fall 2006. Top HIV Med 14:125–130.20. Pevzner PA, Tang H, Waterman MS (2001) An Eulerian path approach to DNA fragment assembly.Proc Natl Acad Sci U S A 98:9748–9753. doi:10.1073/pnas.171285098. URL http://dx.doi.org/10.1073/pnas.171285098 .21. Tyson GW, Banfield JF (2005) Cultivating the uncultivated: a community genomics perspective.Trends Microbiol 13:411–415. doi:10.1016/j.tim.2005.07.003. URL http://dx.doi.org/10.1016/j.tim.2005.07.003 .22. Chen K, Pachter L (2005) Bioinformatics for whole-genome shotgun sequencing of microbial com-munities. PLoS Comput Biol 1:e24. doi:10.1371/journal.pcbi.0010024.
3. The International HapMap Consortium (2005) A haplotype map of the human genome. Nature437:1299–1320.24. Jenkins P, Lyngsø RB, Hein J (2006) How many transcripts does it take to reconstruct the splicegraph? In: Bucher P, Moret BME, editors, WABI. Springer, volume 4175 of
Lecture Notes inComputer Science , pp. 103–114.25. Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall AM, et al. (2002) Genomic analysis ofuncultured marine viral communities. Proc Natl Acad Sci U S A 99:14250–14255. doi:10.1073/pnas.202488399. URL http://dx.doi.org/10.1073/pnas.202488399 .26. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. (2006) Microbial diversity in thedeep sea and the underexplored ”rare biosphere”. Proc Natl Acad Sci U S A 103:12115–12120.doi:10.1073/pnas.0605127103. URL http://dx.doi.org/10.1073/pnas.0605127103 .27. Kececioglu J, Ju J (2001) Separating repeats in DNA sequence assembly. In: RECOMB 2001:Proceedings of the fifth annual international conference on Computational biology. New York, NY,USA: ACM Press, pp. 176–183. doi:http://doi.acm.org/10.1145/369133.369192.28. Tammi MT, Arner E, Britton T, Andersson B (2002) Separation of nearly identical repeats inshotgun assemblies using defined nucleotide positions, DNPs. Bioinformatics 18:379–388.29. Lander ES, Waterman MS (1988) Genomic mapping by fingerprinting random clones: a mathe-matical analysis. Genomics 2:231–239.30. Excoffier L, Slatkin M (1995) Maximum-likelihood estimation of molecular haplotype frequenciesin a diploid population. Mol Biol Evol 12:921–927.31. Stephens M, Smith NJ, Donnelly P (2001) A new statistical method for haplotype reconstructionfrom population data. Am J Hum Genet 68:978–989.32. Halperin E, Hazan E (2006) Haplofreq–estimating haplotype frequencies efficiently. J Comput Biol13:481–500. doi:10.1089/cmb.2006.13.481. URL http://dx.doi.org/10.1089/cmb.2006.13.481 .33. Schmid R, Schiuster SC, Steel MA, Huson DH (2007). Readsim – a simulator for Sanger and 454sequencing. submitted .34. Rhee SY, Gonzales MJ, Kantor R, Betts BJ, Ravela J, et al. (2003) Human immunodeficiencyvirus reverse transcriptase and protease sequence database. Nucleic Acids Res 31:298–303.35. Gonzales MJ, Johnson E, Dupnik KM, Imamichi T, Shafer RW (2003) Colinearity of reversetranscriptase inhibitor resistance mutations detected by population-based sequencing. J AcquirImmune Defic Syndr 34:398–402.36. Doukhan L, Delwart E (2001) Population genetic analysis of the protease locus of human im-munodeficiency virus type 1 quasispecies undergoing drug selection, using a denaturing gradient-heteroduplex tracking assay. J Virol 75:6729–6736. doi:10.1128/JVI.75.14.6729-6736.2001. URL http://dx.doi.org/10.1128/JVI.75.14.6729-6736.2001 .37. Metzner KJ, Rauch P, Walter H, Boesecke C, Zllner B, et al. (2005) Detection of minor populationsof drug-resistant HIV-1 in acute seroconverters. AIDS 19:1819–1825.38. Gottlieb GS, Nickle DC, Jensen MA, Wong KG, Grobler J, et al. (2004) Dual HIV-1 infection as-sociated with rapid disease progression. Lancet 363:619–622. doi:10.1016/S0140-6736(04)15596-7.URL http://dx.doi.org/10.1016/S0140-6736(04)15596-7 .39. Rambaut A, Posada D, Crandall KA, Holmes EC (2004) The causes and consequences of HIVevolution. Nat Rev Genet 5:52–61. doi:10.1038/nrg1246. URL http://dx.doi.org/10.1038/nrg1246 .40. Rouzine IM, Rodrigo A, Coffin JM (2001) Transition between stochastic evolution and deterministicevolution in the presence of selection: general theory and application to virology. Microbiol MolBiol Rev 65:151–185. doi:10.1128/MMBR.65.1.151-185.2001. URL http://dx.doi.org/10.1128/MMBR.65.1.151-185.2001 .41. Dilworth RP (1950) A decomposition theorem for partially ordered sets. Ann Math (2) 51:161–166.42. Ford LR, Fulkerson DR (1962) Flows in networks. Princeton University Press.43. Hopcroft JE, Karp RM (1973) An n / algorithm for maximum matchings in bipartite graphs.SIAM J Comput 2:225–231.44. Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EMalgorithm (with discussions). J R Statist Soc B 39:1–38.45. Simpson EH (1949) Measurement of diversity. Nature 163:688.
6. Collins M, Irvine B, Tyner D, Fine E, Zayati C, et al. (1997) A branched DNA signal amplificationassay for quantification of nucleic acid targets below 100 molecules/ml. Nucleic Acids Research25:2979–2984. upporting Information Fig. S1.
Two different chain decompositions of the read graph for 1000 reads froma population of 5 haplotypes at 3% diversity. The bottom five lines correspond toreads matching a haplotype uniquely; the top to reads matching several haplotypes.One decomposition gets one haplotype entirely correct (top, black); the other gets twodifferent haplotypes essentially correct (bottom, green and yellow). In this way, takingmultiple chain decompositions allows us to reconstruct all haplotypes.29 l l l l l l l l
Number of constructed haplotypes A v e r age H a mm i ng d i s t an c e f r o m c o rr e c t l Diversity3%5%7% l l l l l l l l
Number of constructed haplotypes A v e r age H a mm i ng d i s t an c e f r o m c o rr e c t l Diversity3%5%7% (a) (b) l l l l l l l l
200 400 600 800 1000
Number of constructed haplotypes A v e r age H a mm i ng d i s t an c e f r o m c o rr e c t l Diversity3%5%7% (c)
Fig. S2.
Haplotype reconstruction. Up to 1000 candidate haplotypes were generatedusing Algorithm 5 from 10,000 error free reads drawn from populations of size 10, 20,and 50 (subfigures (a), (b), and (c)) at varying diversity levels. Displayed is a measureof the efficiency of haplotype reconstruction: the average Hamming distance (in aminoacids) between an original haplotype and its closest match among the reconstructedhaplotypes. 30 lllllllllllllllllllllllllllllllllllllllllllllllll . . . . . Haplotypes E rr o r s pe r k b Fig. S3.
Resulting error after error correction on populations with 4% diversity. Pop-ulations with up to 50 haplotypes of equal frequency were created.
ReadSim was usedto simulate pyrosequencing with an error rate of 3–6 errors per kb (after alignment).Error correction sucessfully reduced the error rate by a factor of approximately 30.31 l l l l l l l l
Reads M i n i m a l nu m be r o f e x p l a i n i ng hap l o t y pe s l Errors/kb00.10.2