Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons
11 Reevaluating Assembly Evaluations with Feature ResponseCurves: GAGE and Assemblathons
Francesco Vezzi , ∗ , Giuseppe Narzisi , Bud Mishra , , School of Computer Science and Communication, KTH Royal Institute of Technology, Science for LifeLaboratory, 17121 Solna, Sweden Cold Spring Harbor Laboratory, Cold Spring Harbor, United States of America Courant Institute of Mathematical Sciences, New York University, New York, United States ofAmerica NYU School of Medicine, New York University, New York, United States of America ∗ E-mail: Corresponding [email protected]
Abstract
In just the last decade, a multitude of bio-technologies and software pipelines have emerged to revolu-tionize genomics. To further their central goal, they aim to accelerate and improve the quality of de novo whole-genome assembly starting from short DNA sequences/reads. However, the performance of each ofthese tools is contingent on the length and quality of the sequencing data, the structure and complexityof the genome sequence, and the resolution and quality of long-range information. Furthermore, in theabsence of any metric that captures the most fundamental “ features ” of a high-quality assembly, there isno obvious recipe for users to select the most desirable assembler/assembly. This situation has promptedthe scientific community to rely on crowd-sourcing through international competitions, such as Assem-blathons or GAGE, with the intention of identifying the best assembler(s) and their features. Some whatcircuitously, the only available approach to gauge de novo assemblies and assemblers relies solely on theavailability of a high-quality fully assembled reference genome sequence. Still worse, reference-guidedevaluations are often both difficult to analyze, leading to conclusions that are difficult to interpret. Inthis paper, we circumvent many of these issues by relying upon a tool, dubbed FRC bam , which is ca-pable of evaluating de novo assemblies from the read-layouts even when no reference exists. We extendthe FRCurve approach to cases where lay-out information may have been obscured, as is true in manydeBruijn-graph-based algorithms. As a by-product, FRCurve now expands its applicability to a muchwider class of assemblers – thus, identifying higher-quality members of this group, their inter-relations aswell as sensitivity to carefully selected features , with or without the support of a reference sequence or lay-out for the reads. The paper concludes by reevaluating several recently conducted assembly competitionsand the datasets that have resulted from them.
Introduction
The extraordinary advances in Next Generation Sequencing (NGS) technologies over the last ten yearshave triggered an exponential drop in sequencing cost, thus making it possible to perform whole-genomeshotgun (WGS) sequencing of almost every organism in the biosphere. In particular, recent WGS projectsare distinctive by the way they have facilitated whole genome sequencing at a high coverage ( i.e. , higherthan 50 × ), albeit, composed of relatively short sequences ( i.e. , reads).Despite this impressive progress, recent efforts have underlined the difficulties in trading-off read lengthagainst read coverage. It is now well recognized how the short reads have made the assembly problemsignificantly harder [1] owing to the complexity involved in resolving ( i.e. span over) long repeats.Nonetheless, this challenge has been confronted recently with sophisticated and novel techniques,embedded in a diverse set of tools all aiming to solve de novo assembly problem. Such tools ( i.e. , assemblers ) are based on the simple assumption that if two reads share a sufficiently long subsequence a r X i v : . [ q - b i o . GN ] O c t then they are likely to belong to the same location in the genome. In order to represent and efficientlyuse such information for myriads of short reads, assemblers typically rely on compressed graph structures(often de-Bruijn graphs but also string-graphs). Moreover, additional heuristics are employed for errorcorrection and read-culling.More than twenty different assemblers have been designed to tame the computational complexity ofassembling NGS reads, with the vast majority of them specifically targeting Illumina reads. One of themain consequences of this proliferation in software production is the difficulty in selecting one assemblerover another, which often makes a Buridan’s ass of a bioinformatics researcher: Their effort spent onselecting the best assembler ( i.e. , the largest haystack for the ass) ultimately diverts them from theirultimate objective of answering biological questions ( i.e. , leading to a confused and starving ass).Adding to the confusion, every new genome presents its own sets of problems, e.g , ploidy, heterozigos-ity, repetitive structures, etc. . The available assemblers usually are able to efficiently solve only someof these problems or are specifically designed for limited datasets ( e.g. , bacterial genomes). A widelyfollowed approach is to use multiple assemblers, run with different parameters, producing statistics thatcould point to the best among them. However, no clear way to select the “best” assembler has yetmade itself obvious. As noticed by Miller in [2] all new published assemblers have been compared tothe then-existing tools showing, every time, their better performances on a specific dataset and on somespecific metrics. More often than not, only traditional metrics ( i.e. , contiguity-based metrics) are used incomparing assemblers’ performances ( e.g. , number of contigs, NG50, etc. ) – a strategy that suffers fromthe drawback of emphasizing only assembly size. Moreover, in [3], NG50 (the most “abused” metric)has been demonstrated to be a bad assembly quality predictor. In contrast, more reliable results can beproduced, when a reference sequence is available, since contigs could be aligned against it in order tojudge the number of errors ( i.e. , reference-based metrics). Unfortunately, currently, no effort is usuallymade in weighting or scoring qualitatively different types of errors, thus reducing this approach to asimple error counting without accounting for subtle differences among the different types of errors.More recently, the focus has shifted from seeking just contiguity to assembly precision. An earlierstudy [4] showed that in the published and revised human genome [5] on average 10% of assembledfragments were assigned the wrong orientation and 15% of fragments, placed in a wrong order. Recallthat this draft sequence of the Human Genome [5], which was released in 2001, had taken several largeteams more than five years to finish and validate (but only at a genotypic level). With many projectsleft at draft level, NGS technologies have worsened this situation even further. Alkan in [6] criticizedtwo of the majors NGS achievements: the assembly of the Han Chinese and Yoruban individuals [7] bothsequenced with Illumina reads. Alkan identified 420 Mbp of missing repeat sequences from the Yorubanassembly, and estimated that in both assemblies almost 16% of the genome was missing.Despite these widely discussed and obvious problems, there still persists a lack of standard proceduresand methods to validate and evaluate assemblies. Several projects have been initiated to explore theparameter space of the assembly problem, in particular in the context of short read sequencing [6, 8, 9].Recently, a growing number of studies have aimed at independently evaluating different assemblers orassembly pipelines. Assemblathon 1 [10] and
Assemblathon 2 sought to assess assemblers’ performanceson common datasets encouraging a competition among researchers/users and assemblers’ developers. Inits earliest version, the competition was performed on a simulated dataset, leaving open to a criticismof the effectiveness of its genome and read simulators [3]. Assemblathon 1’s entries were evaluated andranked, based on a mixture of contiguity-based and reference-based metrics. The final result is a largetable (see Table 3 in [10]) in which some assemblers perform well on some metrics while behaving poorlyon others, thus, leaving its interpretation somewhat equivocal.A similar but independent study, dubbed GAGE, has been designed to critically evaluate and compareassemblers on four different large-scale NGS projects [11]. The presence of an already assembled referencesequence for three of the studied genomes allowed the authors to assess assembly quality. One of themain message of this study is that the same assembler can produce utterly different qualities of resultson different datasets. Moreover, Salzberg and colleagues showed how assemblers’ performance is affectedby data quality: preprocessing used in read correction seems fundamental to improve assemblers’ results.The main conclusion of this study is that there is no universal “assembly recipe” to be used for assemblingnew genomes. An assembler working well on certain genomes may exhibit drastically poorer performancewhen used to assemble even a fairly similar genome. A fundamental criticism against GAGE is how itselects the “best” assembly for each assembler and for each dataset: GAGE’s authors chose the assemblywith the largest NG50, thus building on an extremal statistic which, as mentioned earlier, also happensto be the worst quality predictor [3]. As in Assemblathon 1, GAGE output is presented as a set of tableswith massive amount of—often hard-to-interpret—information.This state of affairs is not completely surprising, given the complexity of assembly evaluation, es-pecially, when all errors cannot be substantially eliminated. For instance, even after six months sinceAssemblathon 2’s competition, an official ranking remains undisseminated (except for the one based onNG50).Recently, Narzisi and Mishra in [12] proposed a new metric, Feature Response Curve (FRCurve),capable of capturing the trade-off between contig contiguity and correctness. FRCurve is based on theprinciple that the assembly precision can be predicted by identifying on each contig a set of suspiciousregions ( i.e. , features ): contigs are then sorted from the longest to the shortest, and for each featurethreshold δ only the longest contigs whose total sum of features is less than δ are used to compute thegenome coverage ( i.e. , a single point in the FRCurve). Such technique has been extensively studiedand evaluated in [3]. Despite its power the main limitation of FRCurve is that it requires the so-called read layout , a standard output of Sanger-based assemblers, but missing in the vast majority of NGSassemblers. Such dependency restricts FRCurve analysis tools to only OLC, overlap-layout-consensusbased assemblers and thus to a limited subset of NGS-based studies.In this paper, we present an enhanced tool, named FRC bam , capable of computing FRCurve fromthe alignment of the reads to the assembled contigs. In particular, we show that this method is able tocorrectly and rigorously evaluate assemblers’ performance and precision, even in the absence of a referencesequence, while using a broad set of metrics, not just those based on assembly contiguity. We begin bydescribing the set of implemented features, and then evaluate our tool on the datasets used in the threemajor assembly evaluations efforts: GAGE, Assemblathon 1 and Assemblathon 2. Materials and Methods
Almost always, de novo assembly is carried out using more than one library. In the Illumina scenario wetypically have at least two libraries: one paired-end library (PE), and one mated-pair library (MP). Theformer provides paired reads in the standard orientation ( → ← ) with insert size that can vary between150 bp (overlapping fragments) and 1000 bp (standard PE). The latter yields pairs of sequences in theopposite direction ( ← → ) and the insert size is much longer (usually in the range between 3 and 10 Kbp).Due to the different cost of the two protocols a typical sequencing project consists of one high coveragePE library and one low coverage MP library. The main advantage of MP reads is to improve contiguitythrough scaffolding and gap-filling procedures. However, the MP library is intrinsically more difficult toobtain than standard PE libraries and are usually affected by redundancy (PCR duplicates) and unevengenome representation.After PE reads and MP reads are aligned against the assembly itself, the ordered and indexed BAMfiles will be the input of FRC bam . FRC bam needs at least one PE library and, if available, one MP library.The user needs to provide a rough estimation of the insert size and of the standard deviation for bothlibraries and an estimation of the genome length. Read coverage and spanning coverage are computeddirectly from the BAM files.Several features are computed in order to identify problems related to read coverage, mate pair hap-piness [8], and compression/expansion events ( i.e. , CE-statistics) [13]. As a consequence of their differentnature, PE reads and MP reads are used to compute two different sets of features. The former is usedto compute the following features:
LOW_COV_PE , HIGH_COV_PE , LOW_NORM_COV_PE , HIGH_NORM_COV_PE , COMPR_PE , STRECH_PE , HIGH_SINGLE_PE , HIGH_SPAN_PE , and
HIGH_OUTIE_PE . The latter library is usedto compute only a subset of the features, similar to the ones in the previous set:
COMPR_MP , STRECH_MP , HIGH_SINGLE_MP , HIGH_SPAN_MP , and
HIGH_OUTIE_MP . The main difference is due to the fact that MPreads usually provide a low read coverage ( i.e. vertical) but produce a high spanning coverage ( i.e. hor-izontal). Therefore MP reads are best used to compute features related to long range information (seeTable 1 and Supplementary material for a detailed description of features).FRC bam output consists of several files: ( a ) the FRCurve itself (to be plotted), ( b ) the FRCurves foreach individual feature, and finally, ( c ) a position-by-position description of the feature (in GFF format).This last file holds for each contig the identified features, together with the start and end points. Datasets
For comparative analysis of NGS assemblers, both GAGE and Assemblathon studies offer state-of-the-art datasets, which could also be re-purposed to evaluate reliability of the new FRC bam . These datasetswere of particular interest to us for several reasons, falling into three categories: ( i ) datasets consistof state-of-the-art sequences, with reads often belonging to several paired-end and mate-pairs libraries;( ii ) availability of already “optimized” assemblies; ( iii ) presence of a reference sequence for most of thesequenced organism.The first category allowed us to test FRC bam against state-of-the-art datasets and to take advantageof different insert-types. The second category enabled us to use assemblies that may be considered as the“best” achievable, since they were obtained by de novo assembly experts ( i.e. , GAGE) or by the sameassemblers’ developers ( i.e. , Assemblathon). Specifically, the availability of a reference sequence, allowus to measure assemblies’ correctness, thus also demonstrating how FRC bam and the computed featuresare able to effectively gauge assembly accuracies and to identify suspicious regions ( i.e. , mis-assemblies).In total, we tested FRC bam on five datasets: Staphilococcus aureus , Rhodobacter sphaerodis , andHuman chromosome 14 from GAGE, data of simulated genomes from Assemblathon 1 competition, and
Boa constrictor ( i.e. , Snake) from Assemblathon 2 competition. All five datasets are composed of highcoverage ( i.e. , all exceeding 40 × ) Illumina paired-end and mate-pair reads libraries. S. aureus has beenassembled with 7 different assemblers (see Table 2),
R. sphaerodis and Human chromosome 14 (hereafterHc14) have been assembled with 8 different assemblers (see Tables 3 and 4). Assemblathon 1 andAssemblathon 2 comprise 59 and 12 entries respectively. The large number of Assemblathon 1 entries issimply a consequence of the rule to permit multiple submissions: we decided to download only the bestentry from each team, as determined by the Assemblathon 1 ranking (refer to [10] for more details), fora total of 17 entries. Summarizing, we tested FRC bam on five extremely different datasets for a total of43 assemblies.For each dataset we selected one paired-end library and one mate-pair library (see Supplementarymaterial for more details). These two libraries were then aligned against the available assemblies us-ing rNA [3]. We aligned reads using also BWA [14] without detecting any noticeable difference (seeSupplementary material).Using libraries with different insert sizes ( i.e. , paired-end and mate-pair reads) enabled us to identifydifferent features types. On the one hand, paired-end reads, characterized by a short insert size ( i.e. ,usually less than 600 bp) are able to highlight local mis-assemblies and relatively small insertions/deletionsevents. On the other hand, mate-pairs, characterized by a larger insert size ( i.e. , usually more than 2Kbp) are able to highlight larger insertion/deletion events and larger mis-assemblies ( e.g. , scaffoldingerrors).
Results
Figure 1 shows FRCurves for the three GAGE genomes (
S. aureus
Figure 1(a),
R. spheroides
Figure 3,and Hc14 Figure 1(c)) and for Assemblathon 1 entries (Figure 1(d)). For each of the analyzed assemblieswe aligned contigs against the reference genome. To accomplish this task we employed the scripts availableon GAGE website [11]. Assembly statistics are reported in Tables 2, 3, 4, and 5.The four tables (Tables 2, 3, 4, and 5) report for each assembly/assembler the number of con-tigs/scaffolds produced (Ctg), the NG50, the percentage of short ( i.e. less than 200 bp) contigs, thepercentage of duplicated (Dupl) and compressed (Comp) regions in the assembly (all the percentages arecomputed with respect to the real genome length), the number of long ( i.e. , > dnadiff [8]we could identify regions of real mis-assemblies, thus enabling us to compute sensitivity and specificityof our features. Note that sensitivity is defined as the ratio between true positives ( i.e. , positions markedas mis-assembled by dnadiff and labelled by one or more features), and its sum with false negatives ( i.e. ,positions marked as mis-assembled by dnadiff but not labelled by any feature). Specificity, instead, isthe ratio between true negatives ( i.e. , positions not marked as mis-assembled by dnadiff and not labelledby any feature) and its sum with false positives ( i.e. , positions not marked as mis-assembled by dnadiff but labelled by one or more features). The first measure enables FRCurve to identify problematic areas,while the latter measure distinguishes non -problematic from problematic regions ( e.g. , if a feature marksall position in an assembly the sensitivity will be 1, however the specificity is likely to be close to 0). GAGE
Figure 1(a) and Table 2 show the FRCurve and the reference guided validation of
S. aureus
GAGE’sdataset respectively. From Figure 1(a) MSR-CA and Allpaths-LG appear to be the best performingassemblers on such datasets ( i.e. , the sharpest curves). These two assemblers are closely followed bySOAPdenovo, Velvet, and Bambus2, while SGA and ABySS clearly show bad performance. Both sen-sitivity and specificity of reported features are high (last two columns of Table 2), thus demonstratingthat FRC bam (and therefore our features) is able to correctly identify suspicious regions. Specificityis not particularly high only for ABySS and SGA. However, in these two assemblies the percentage ofmis-assembled sequences identified by dnadiff are 20% and 8%, respectively, suggesting a high numberof problematic regions close to the real mis-assembly sequences.Some remarks are warranted on the stepwise shape of some curves ( e.g. , MSR-CA, Allpaths-LG andBambus2). Such a shape indicates the presence of contigs with a large number of features that interruptsa smooth growth of the curve, which is particularly discernible when the number of contigs is low. Asan example, consider the longest MSR-CA contig containing almost half of the features identified inthe entire assembly. The high sensitivity and specificity reported in Table 2 show that these featuresrepresent truly problematic regions. Let us focus on Allpaths-LG and MSR-CA: in Figure 2 we present thealignment of the longest scaffold produced by Allpaths-LG and MSR-CA against the reference genome.From Figure 2(b) it is clear that the stepwise shape of MSR-CA’s FRCurve is a consequence of wrongchoices made by the assembler. The situation is different in the Allpaths-LG case: Figure 2(a) shows acorrectly reconstructed scaffold, therefore there is apparently no reason to justify the stepwise curve ofAllpaths-LG. Puzzled by this anomaly, we plotted the FRCurve for each single feature (see Supplementarymaterial). With this analysis, we discovered that Allpaths-LG has the best curve in the majority of thecases. However, there are two exceptions:
STRECH_MP and
COMPR_MP features, which are representativeof compression or expansion events. Areas characterized by these features coincide with the circles inthe dotplot (see Figure 2(a)): these areas involve small mis-joins ( i.e. , less than 50 bases) or scaffoldjunctions ( i.e. , sequences of Ns). A likely explanation is that such small mis-joins are able to “attract”reads that are responsible for the features. Moreover,
STRECH_MP and
COMPR MP features depend on CEstatistics [13] and therefore on the choice of two thresholds, often estimated sub-optimally — note that,despite the availability of a reference sequence, these thresholds were estimated without it. MSR-CA isthe assembly characterized by the largest number of areas composed of large numbers of mis-orientedmate/paired reads ( i.e. , HIGH_OUTIE_PE and
HIGH_OUTIE_MP ), as a consequence of the large inversionsand translocations present in the first scaffold. Other hints about MSR-CA’s problems come from theFRCurve obtained from the contigs (see Supplementary material): MSR-CA’s FRCurve is not as goodas those of Allpaths-LG, SOAPdenovo and Bambus2.This situation demonstrates that assembly evaluation is extremely difficult. With the help of areference sequence it is clear that MSR-CA suffers from a large number of errors (see Table 2). However, inits absence, many users might have chosen MSR-CA over others, since it seemed to be able to reconstructalmost the whole genome with a single scaffold. FRCurve, without the use of a reference, was ableto raise doubts about MSR-CA ( i.e. , the only assembler with a high number of
HIGH_OUTIE_PE and
HIGH_OUTIE_MP features), thus suggesting a more careful manual validation on Allpaths-LG.According to the FRCurve analysis, SGA (together with ABySS) is one of the worst performingassemblers. Although GAGE analysis concludes that SGA introduces relatively fewer errors, it is also themost fragmented one, consisting of 456 scaffolds (and 1252 contigs). This kind of assemblies, despite itslow error-rate, tends to accumulate features related to copy number variation problems ( e.g. , LOW_COV_PE )and features like
HIGH_SPAN_MP suggesting problems in the scaffolding ( i.e. , either errors in the scaffoldingor a failure in establishing contig connections).Similar analyses can be carried out for
R. sphaeroides and Hc14 datasets whose FRCurves are repre-sented in Figures 1(b) and 1(c).In
R. sphaeroides dataset Allpaths-LG and MSR-CA again appear to be the two best performingassemblers, though SOAPdenovo, Velvet, and Bambus2 are not too far behind. The longest Allpaths-LG scaffold practically reconstructs the longest Rhodobacter chromosome: such scaffold contains only100 features most of them suggesting the presence of regions affected by low paired read coverage ( i.e. , LOW_NORM_COV_PE and
LOW_COV_PE ). Such features affect all others assemblers similarly. From FRCurveanalysis one may conclude that Allpaths-LG is the best performing tool. The alignments of Allpaths-LGassembly against the reference further confirm this conclusion (see Supplementary material).Bambus2 is characterized by a long (correct) scaffold that contains almost one third of its fea-tures. This situation is a consequence of regions composed of a large number of singleton reads ( e.g. , HIGH_SINGLE_MP ) and of areas suggesting the presence of compression events ( e.g. , COMPR_MP ). Similarlyto the analysis of the
S. aureus , these features seem to coincide with small gaps (as the alignment of thelongest Bambus2 scaffold against the reference sequence shows, see Supplementary material).From Figure 1(b) CABOG appears not to be a very well performing assembler. Such situation isconfirmed by Figure 3 that shows the dotplot for CABOG’s longest scaffolds. The green columns at thebottom of the dotplot indicate the position where one or more features have been found by FRC bam .This plot shows how features are able to highlight problematic regions in the assembly, as the majorityof them coincide with the mis-assemblies.In the Hc14 case (see Figure 1(c)) Allpaths-LG and CABOG are clearly the best two assemblers.Allpaths-LG is the only assembler able to assemble almost all the sequences in a single scaffold contain-ing, practically, all the features. The total number of features identified on this long scaffold is lowerthan the total amount of features identified in the 400 longest CABOG scaffolds. When we consider theFRCurves for each individual feature (see Supplementary material), we notice that Allpaths-LG longestcontig is characterized by a large number of features suggesting coverage problems ( e.g , LOW_NORM_COV_PE , LOW_COV_PE , HIGH_NORM_COV_PE , and
HIGH_COV_PE features) and mated/paired read orientation prob-lems ( e.g. , HIGH_OUTIE_PE , and
HIGH_OUTIE_MP features). As far as the coverage features are concerned,Allpaths-LG has almost always a lower number of such features than the other assemblers. Moreover,
LOW_NORM_COV_PE feature is often the consequence of Allpaths-LG’s ability to correctly resolve repeatedregions (pairs are not correctly aligned as a consequence of a repeat, see Supplementary material). Lessstraightforward is the explanation for the large number of features suggesting the presence of a largenumber of mis-oriented pairs (in this case Allpaths-LG being one of the worst assemblers). Such featuresare indicative of inversions and insertions events, although the dotplot shows an almost contiguous scaf-fold that reconstructs the Chromosome 14 without any particular problem (see Supplementary material).After a closer inspection, we discovered that such long scaffold is affected by a large number of smallmis-joins as suggested by the circles in the main dotplot diagonal (see Supplementary material). Wetested 10 different areas subject to such mis-joins and in all cases we discovered either a scaffold joint istoo large or a scaffold joint has a short chimeric sequence in the middle, thus explaining the presence ofa feature. The presence of these small mis-joins has been reported also in GAGE analysis: in the Hc14dataset, the NG50 was close to 81 Mbp while the corrected-NG50 was 20 times shorter (the corrected-NG50 is the NG50 computed after breaking contigs at mis-assembled positions identified by the referencesequence). The low number of compression/expansion features ( i.e. , CE statistics), as well as the lownumber of high-spanning and high-single reads related features in Allpaths-LG assembly (see FRCurvesplots in Supplementary material) suggests that Allpaths-LG is able to return an assembly that is highlyand correctly connected. However, the relatively large number of paired-end related features suggests thepresence of small local mis-assemblies. On the other hand, CABOG produced a more fragmented assem-bly characterized by a small number of features. CABOG’s most frequent features ( i.e. , HIGH_SPAN_PE and
HIGH_SPAN_MP ) suggest a systematic failure during the scaffolding phase in correctly merging contigsand inferring their order.From FRCurve analysis alone, it is much harder to decide between the top two assemblers: Allpaths-LG and CABOG, though when the reference sequence is available, it is evident that Allpaths-LG suffersless from errors than CABOG (see Table 4). When considering only contigs (see Supplementary material)CABOG and Allpaths-LG still outperform other assemblers, as clearly proved by GAGE analysis (longestNG50).With almost 30,000 features MSR-CA is the third ranking assembler as determined by the FRCurveanalysis. MSR-CA is closely followed by SOAPdenovo and SGA. It is again difficult to fully ascertainsuch ranking, and even the reference guided validation in Table 4 does not lead to a clear and conclusiveopinion. The majority of SGA’s features are a consequence of the highly fragmented assembly (see
HIGH_SINGLE_MP
FRCurve in Supplementary material). However the small number of errors (see Table4) demonstrates that the final sequences are correct. SOAPdenovo is slightly better than MSR-CA asfar as the number of errors is concerned, notwithstanding the fact that SOAPdenovo is more fragmentedthan MSR-CA. SOAPdenovo is particularly affected by the presence of mis-oriented paired reads ( i.e. , HIGH_OUTIE_PE feature).In all the three GAGE datasets the sensitivity of the FRC bam is almost always higher than 90%(CABOG is an exception, but it must be noted that the percentage of mis-assembled sequences is less than1.4% of the genome length). Specificity is in general high, with the exception of assemblies characterizedby high errors rates ( e.g. , more than 40% of Velvet assembly is marked as suspicious by dnadiff on Hc14).
Assemblathon 1
Assemblathon 1 dataset differs from that of GAGE mainly in two ways: it is much larger and it is obtainedsolely by simulation. Figure 1(d) and Table 5 summarize the analysis performed on such datasets. It is ofparticular interest to compare FRCurve assembly evaluation with Assemblathon 1 paper evaluation [10].The order of the entries in Table 5 and of the legend in Figures 1(d) follows the Assemblathon 1 ranking.Despite the presence of some outliers, the FRCurve analysis is close to the ranking obtained by Earl et al. . BGI, WTSI-S, DOEGI, and CSHL were found by the FRCurve analysis to be better performingassemblers. They, together with Broad Institute’s ( i.e. , Allpaths-LG), were the five best assemblers ac-cording to Assemblathon 1 ranking. A similar analysis could determine the worst performing assemblers.CIUoC, GACWT, UCSF, ASTR, and IRISA are clearly characterized by undesirable FRCurves (CIUoC’slong contigs contain few errors, even though the assembly contains only a fraction of the whole genomeand small contigs contain many features).There are some clear differences, for example Broad’s Allpaths-LG, the best assembler in Assem-blathon 1 ranking is clearly among the best ones also in our FRCurve-based analysis, but has a highnumber of features suggesting problems with paired reads ( i.e. , LOW_NORM_COV_PE and
HIGH_SPAN_PE features). We discovered that these two features are highly correlated: in all the analyzed cases wediscovered the presence of a small contig perfectly (or almost perfectly) aligning against a larger contig,probably the result of a wrong copy number estimation or of an unresolved allele splitting event. Thisobservation is consistent with the analysis by Eearl et al. , as, for instance, Broad’s entry ranks 11 th forcopy number statistics.Another clear difference is CRACS, the 6 th ranking assembler in Assemblathon 1 evaluation, but anaverage performing assembler according to FRCurve analysis. The poor performance of this assembler isobserved in a series of long contigs all exhibiting an extremely high coverage ( i.e. , HIGH_COV_PE ). Thisis clearly reflected also in the ranking given by Assemblathon 1: CRACS has clear problems in inferringcopy number variation (12 th ranking tool) and it reconstructs only 96% of the genome (14 th rankingtool). FRCurve analysis suggests two possible solutions: either discard contigs strongly affected by thisfeature, or have CRACS developers reimplement an improved copy number variation estimation.The last two assemblers we considered are RHUL and IoBUGA. Also in this case, these assemblers haveFRCurves comparable to the best assemblers, but have been ranked below the median in Assemblathon-1’sevaluation. According to Assemblathon-1’s evaluation, RHUL has an acceptable number of substitutions(5 th ranking tool); it is able to assemble sequences in the right copy number (5 th ranking tool); and itis able to reconstruct (cover) the large part of the reference (4 th ranking tool). However, it lacks goodconnectivity (13 th ranking tool). FRCurve shows this assembler to contain most of its features in thelongest scaffolds, while the short ones contain a small number of features. Note that the longest of RHUL’s scaffolds generates a curve similar to ASTR’s. IoBUGA offers a similar story. Assemblathon-1’s rankingis difficult to interpret (15 th ranking tool for substitutions and gene coverage but 3 rd ranking tool forcopy number variation). This situation reemphasizes that reference guided validations are extremelydifficult to interpret, especially when a tool exhibits contradicting performance. It should also be pointedout that IoBuga has the lowest sensitivity (see Table 5). It is clear that new features may be addedin order to improve the effectiveness of FRC bam and FRCurve analysis. In this case, the availability ofRNA-seq data may allow design of new features, capable of capturing assemblers’ ability to reconstructgene expressions, splicing variants and intron-exon boundaries. Assemblathon 2
As shown earlier, the GAGE datasets were sufficient for testing the performance of FRC bam using onlyrelatively small datasets. But with access to reference sequences, some of the limitations of the analysisbecame evident: only
S. aureus and
R. spaeroides are realistic datasets, while Hc14 has been partiallysimulated (reads have been aligned and extracted, see [11] for more details). Moreover,
S. aureus and
R. spaeroides datasets are extremely small in size and, to some extent, represent fairly easy-to-assemblegenomes ( i.e. , no heterozygosity or high ploidy). With access to Assemblathon 1 data, we further testedthe FRC bam against a larger dataset that was previously analyzed and ranked. The main limitation ofthis dataset stemmed from the use of simulated reads, which often diverged from any reasonable modelof reality.In order to show the applicability of our method to larger sequencing projects we tested the FRC bam on all Assemblathon 2 entries for the Snake dataset (
Boa constrictor ). Results are shown in Figure 4.Surprisingly, when all features are considered all together, their FRCurves coincide closely with each other(see Figure 4(a)) suggesting that Assemblathon 2 participants, or the tools used by them, are convergingto common results. We can identify two teams (assemblers) that are doing better than the others: SGAand Meraculous. There is a dense conglomerate of similarly behaving assemblers consisting of ABySS,Phusion, SOAPdenovo, CRACS, and Ray. Other assemblers appear less promising, though, except forthe sole example of PRICE, none of them show unacceptably bad performance. The good performanceof CRACS on this dataset brings to mind how drastically differently the same assembler could behave ondifferent datasets.Results are different if we concentrate on one feature at a time (see Figure 4(b) and Supplementarymaterial). As an example, by inspecting the plot for the
HIGH_SPAN_PE feature, we observe that GAMoutperforms all the other assemblers. Meraculous and SGA show good performance too, together withCurtain, Symbiose, and BCM-HGSC.
HIGH_SPAN_PE feature indicates presence of mis-joins, as oftenpresumably close-by pairs are found in different contigs/scaffolds.Particularly interesting is the FRCurve plot describing the presence of areas composed mainly ofsingle ended reads ( i.e , HIGH_SINGLE_MP feature, see Supplementary material). All assemblers are stronglyaffected by this feature demonstrating a general failure of all tools. A likely explanation is in a systematicfailure in correctly assembling heterozygous loci , which generates holes in the assemblies, thus confoundingthe assemblers attempting to place both reads of mate-pairs. Note that this behavior is not present inthe
HIGH_SINGLE_PE features. A feature like
HIGH_SINGLE_MP is clearly not informative in this datasetand may be ignored without affecting the analysis.
Discussion
Limitations of de novo assembly evaluation
The rapidly growing set of new assemblers aims to address the need for assembly tools capable of handlingthe vast amount of data produced by NGS ( e.g.
Illumina) sequencers. This growth in data and tools,however, has led to another unmet need: a rigorous comparative study of these assemblers, which so farhas only been carried out in a rather na¨ıve way. Developers have focused more on performance ( e.g. ,RAM and CPU time) and connectivity ( e.g. , contig number and NG50) rather than on correctness.A commonly employed approach, currently being used to validate and gauge assemblies, is basedon a plethora of standard validation metrics . We can identify four main groups: length-base statistics,reference-based statistics, simulation-based statistics, and long-range-information (LRI) based statistics.Length-based statistics take into account only the size of the assembler output. These statisticscomprise mean contig length, maximum contig length, and NG50. NG50, in principle, gives an ideaof assemblies’ connectivity level. All length-based statistics are not linked to assembly correctness andemphasize only length: an assembler that eagerly merges together contigs can produce assemblies char-acterized by a large NG50 and by few long contigs. However, these long contigs are of no use if theycontain too many misassemblies. NG50 has been shown in [3] to be a bad quality predictor. Never-theless, length-based statistics are the basic, and some times the only, method used to judge assemblersperformances, especially when the assembly tools are new [15, 16].Assembly analysis would trivialize if the genome to be assembled was already available, which wouldmake it possible to compare assemblers using only the reference-based statistics. The strategy would beto resequence an organism with an already available fully finished whole genome reference sequence. Thisapproach would enable comparing assemblers from the computed real number of errors. The underlyingpremise is that good performances of an assembler on one dataset should reflect behavior on a widerrange of datasets. However, studies like GAGE has shown that the same assembler can produce utterlydifferent results on different genomes and different datasets — thus dashing any hope of generalizing theperformance of a tool on the basis of a single dataset. Moreover, reference-metrics are in general difficultto interpret or, at least, are open to several interpretations: as an example reference-based metrics havebeen used both to demonstrate the high quality assembly of two human individuals in [7] as well as todemonstrate the opposite (their poor quality) in [6].Simulation-based statistics face even more extreme hurdles: reads are simulated from a referencesequence and subsequently assembled. Vezzi et al. showed in [3] that simulated reads are likely to produceunrealistic contigs that cannot be used to judge assemblers’ performance. Despite these shortcomings,0competitions like Assemblathon 1 have continued to use a simulation-based approach.A more reasonable way to assess assembly correctness consists in the use of long range information.Second Generation Technologies are able to produce mate-pairs , that are pairs of reads at a mean distanceof 2 − i.e. , mate-pair happiness). Other two commonly used LRI-methods are physical maps [17]and optical maps [18, 19]. Both rely on the relative locations of different genes and other DNA sequencesof interest in the genome. Third Generation Sequencing Technologies (also known as Single MoleculeSequencing Technologies) and dilution-based sub-genomic sampling can also be used in the near futureto estimate assembly correctness. The main drawback of LRI statistics is the fact that they requirethe production of new and often expensive data. Moreover, apart from the simple counting, it remainsunclear how such information should be used to rank different assemblies that currently exist. FRCurve
The aim of this work is to present a new simple tool able to accurately evaluate assemblies and as-semblers’ performance even in the absence of a reference sequence. Features have been first introducedin [8] to identify possible mis-assemblies. Narzisi and Mishra [9] used such features to compute the socalled Feature Response Curve (FRCurve). FRCurve is closely connected to the standard receiver op-erating characteristic (ROC) curve: the Feature-Response curve characterizes the sensitivity (coverage)of the sequence assembler output (contigs) as a function of its discrimination threshold (number of fea-tures/errors). Given a set of features, the response (quality) of the assembler output is then analyzed asa function of the maximum number of possible errors (features) allowed in the contigs. More specifically,for a fixed feature threshold τ , the contigs are sorted by size and, starting from the longest, only thosecontigs are tallied, if their sum of features is less than τ . For this set of contigs, the correspondingapproximate genome coverage is computed, leading to a single point of the Feature-Response curve.Vezzi et al. [3] analyzed Feature space using multivariate techniques ( i.e. , PCA and ICA) in order tostudy features’ interactions and to use these to select the most important ones. Such study, however,highlighted one of the main weak points of FRCurve: the need of a layout file, that is, a file describing thepositions and orientations of each read (and therefore, each pair). While this file had been standard withold Sanger-based assemblers, only a small fraction of NGS-based assemblers provide such information( i.e. , Velvet, Ray, Sutta). Another relevant problem, deeply connected to the first, is the fact thatfeatures were computed by amosvalidate . Such features are commonly available for Sanger reads, clearlycharacterized by widely-varying insert-size distributions and expected coverages.Results summarized in this paper clearly show that FRC bam is able to effectively detect mis-assembliesand that it is able to rank assembler performances. The tool achieves high sensitivity and high specificitythus demonstrating that the implemented features are able to capture the large majority of the problems.Currently 9 features are computed using reads from paired-end libraries, while other 5 are computedusing reads from a mate-pair library. FRCurve is computed using all of them, however the user isfree to concentrate only on a subset of them (PCA can be used as shown in [3] to study features,see Supplementary material). New forensics features can be easily added to the program in order tohighlight new problematic regions: small indels can be identified using reads aligned with gaps ( i.e. , readsaligned with Smith-Waterman-like algorithm), problems in reconstructing gene space can be identifiedusing RNA-seq reads, physical-maps or long single-molecule-sequences can be used to compute features,highlighting scaffolders’ performance.Mapping reads back to the assembly provides only a rough approximation of the layout generation,especially in presence of repeat-structures: in such cases, reads that belong to correctly (or incorrectly)reconstructed duplicated regions can only be mapped randomly on one of the possible occurrences, thus,1jeopardizing the hope of obtaining a correct layout. FRCurve’s ability to detect mis-assemblies is clearlylimited by the presence of non-uniquely aligning reads ( i.e. , reads aligning optimally in two or morepositions). Thus, as the repetitive structures in a genome increase, which complicates the assemblyproblem, so does the difficulty in providing valid assembly evaluation. As the read-lengths increaseor mate-pairs of different lengths become feasible, not only does the assembly problem become moretractable, but also new features enable better identification of problematic regions.Despite the severe limitations imposed by the strategy of approximating read layout with read align-ment, the present trend suggests that assemblers may continue to avoid producing layout files. Thus, itis believed that FRC bam and, more in general, forensics features, will need to be computed by mappingreads back to the assembled sequence. The approach to approximate the layout by mapping reads backto the assembly has several advantages: ( i ) possibility to scale to any genome size (FRC bam is currentlybeing used to evaluate Spruce genome assembly, which will produce a reference genome of length 20Gbp); ( ii ) possibility to compute new forensics features; ( iii ) study relationships among features in amore uniform way.Thanks to the feature-by-feature analysis, the FRCurve is often able to express and explain the currentlimitations of different assemblers. In many situations it is straight-forward to rank the assemblers simplyby inspecting the FRC curves. Even when the scenario is unclear, FRCurve is still useful to highlightadvantages and disadvantages of one assembler over the other ( e.g. , an assembler that presents goodlong range connectivity but makes many mistakes in the small contigs, versus an assembler that has lowconnectivity but does not present local mis-assemblies). It is important to recall that, currently, none ofthe standard de novo evaluation metrics is able to capture these situations in the absence of a referencesequence.We believe that features-based analysis will guide efforts aimed at de novo assembly evaluation and de novo assembler design. Our results clearly show that FRCurve can easily separate the best assembliesfrom the worst ones. By comparing feature-specific curves one can evaluate strong and weak pointsof each assembler and choose the system that best fits one’s objective. It is hoped that, in future,assembler-developers will be guided by the features-based analysis to improve these tools — at the coreof the current genomic revolution. Software and Data Availability
The sequencing data used in this study is publicly available on the GAGE website and on the Assem-blathon website (details are available in Supplementary material). FRC bam source code can be down-loaded from https://github.com/vezzi/FRC align.git
Supplementary Material
All supplementary material is available at: ∼ vezzi/publications/supplementary.pdf and http://cs.nyu.edu/mishra/PUBLICATIONS/12.supplementaryFRC.pdf Acknowledgments
We would like to thank all the Spruce Assembly Project, in particular Prof. Lars Arvestad, Bj¨orn Nystedtand Nathaniel Street for their constant feedback and advice. We also wish to thank Mike Schatz of CSHL,Mihai Pop, XXX for many useful comments on an earlier draft of the paper. Moreover, FV would like2to thank Knut and Alice Wallenberg Foundation for their support. The research reported in this paperwas partially supported by an NSF CDI grant.
References
1. Nagarajan N, Pop M (2009) Parametric complexity of sequence assembly: Theory and applicationsto next generation sequencing. Journal of Computational Biology 16: 897–908.2. Miller J, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data.Genomics : 1–13.3. Vezzi F, Narzisi G, Mishra B (2012) Feature-by-Feature Evaluating De Novo Sequence Assembly.PLoS ONE 7: e31002.4. Semple CAM (2003) Assembling a View of the Human Genome, John Wiley & Sons, Ltd. pp.93–117. doi:10.1002/0470867302.ch5. URL http://dx.doi.org/10.1002/0470867302.ch5 .5. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. (2001) Initial sequencing andanalysis of the human genome. Nature 409: 860–921.6. Alkan C, Sajjadian S, Eichler E (2010) Limitations of next-generation genome sequence assembly.Nature methods 8: 61–65.7. Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform.Bioinformatics 26: 589.8. Phillippy AM, Schatz MC, Pop M (2008) Genome assembly forensics: finding the elusive mis-assembly. Genome biology 9: R55.9. Narzisi G, Mishra B (2011) Comparing De Novo Genome Assembly: The Long and Short of It.PLoS ONE 6: e19175.10. Earl DA, Bradnam K, St John J, Darling A, Lin D, et al. (2011) Assemblathon 1: A competitiveassessment of de novo short read assembly methods. Genome Research .11. Salzberg SL, Phillippy aM, Zimin aV, Puiu D, Magoc T, et al. (2011) GAGE: A critical evaluationof genome assemblies and assembly algorithms. Genome Research .12. Narzisi G, Mishra B (2010) Scoring-and-Unfolding Trimmed Tree Assembler: Concepts, Constructsand Comparisons. Bioinformatics (Oxford, England) 27: 153–160.13. Zimin AV, Smith DR, Sutton G, Yorke Ja (2008) Assembly reconciliation. Bioinformatics (Oxford,England) 24: 42–5.14. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform.Bioinformatics (Oxford, England) 25: 1754–60.15. Zerbino D, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijngraphs. Genome research 18: 821–829.16. Reinhardt J, Baltrus D, Nishimura M, WR (2008) Efficient de novo assembly of bacterial genomesusing low coverage short read sequencing. Genome .17. Brody H, Griffith J, Cuticchia AJ, Arnold J, Timberlake WE (1991) Chromosome-specific recom-binant dna libraries from the fungus aspergillus nidulans. Nucleic Acids Res 19: 3105–3109.318. Anantharaman TS, Mishra B, Schwartz DC (1997) Genomics via optical mapping iii: Contiginggenomic dna and variations (extended abstract). AAAI Press, pp. 18–27.19. Anantharaman TS, Mishra B, Schwartz DC (1997) Genomics via optical mapping ii: Orderedrestriction maps. Journal of Computational Biology 4: 91-118.
Figure Legends a pp r o x i m a t e c o v e r ag e ( % ) Feature Space TOTAL
ABySSAllpaths-LGBambus2MSR-CASGASOAPdenovoVelvet (a) Staphylococcus aureus: GAGE entries ,
000 1 ,
100 1 ,
200 1 ,
300 1 ,
400 1 , a pp r o x i m a t e c o v e r ag e ( % ) Feature Space TOTAL
ABySSAllpaths-LGBambus2CABOGMSR-CASGASOAPdenovoVelvet (b) Rhodobacter: GAGE entries . . . . . . . . . . . · a pp r o x i m a t e c o v e r ag e ( % ) Feature Space TOTAL
ABySSAllpaths-LGBambus2CABOGMSR-CASGASOAPdenovoVelvet (c) Human chromosome 14: GAGE entries . . . . . · a pp r o x i m a t e c o v e r ag e ( % ) Feature Space TOTAL
BROADWTSI-SBGIDOEJGICSHLCRACSEBIBCCGSCIoBUGARHULWTSI-PDCSISUIRISAASTRUCSFGACWTCIUoC (d) Assemblathon1 entries
Figure 1.
FRCurve computed on three GAGE datasets and on Assemblathon 1 entries.4
NC_010079gi|161510924|ref|NC_010063.1|gi|225631039|ref|NC_012417.1|0 200000 400000 600000 800000 1000000 1200000 1400000
QRY (a) Staphylococcus aureus vs Allpaths-LG NC_010079gi|161510924|ref|NC_010063.1|gi|225631039|ref|NC_012417.1|0 500000 1000000 1500000 2000000
QRY (b) Staphylococcus aureus vs MSR-CA Figure 2. dotPlots for Staphylococcus: MSR-CA and Allpaths-LG longest contigs have been alignedagainst the reference genome.5
Figure 3.
Dotplot validation of the longest scaffold produced by CABOG on Rhodobacter dataset.The green line represents the Features identified by FRC bam . . . . . . . . . . . . · a pp r o x i m a t e c o v e r ag e ( % ) Feature Space TOTAL
BCM-HGSCRayCurtainGAMPhusionmeraculoussgaSymbioseABySSCRACSSOAPdenovoPRICE (a) Assemblathon 2 FRCurve: all features . . . . . . . . . . . . . . . . · a pp r o x i m a t e c o v e r ag e ( % ) Feature Space HIGH SPAN PE
BCM-HGSCRayCurtainGAMPhusionmeraculoussgaSymbioseABySSCRACSSOAPdenovoPRICE (b) Assemblathon 2 FRCurve: High spanning PE feature
Figure 4.
FRCurve computed on Assemblathon 2 entries. Figure 4(a) shows FRCurves for all thefeatures, while Figure 4(b) shows the FRCurves plotted on a single feature6
Tables
Feature Description
LOW_COV_PE low read coverage areas (all aligned reads).
HIGH_COV_PE high read coverage areas (all aligned reads).
LOW_NORM_COV_PE low paired-read coverage areas (only properly aligned pairs).
HIGH_NORM_COV_PE high paired-read coverage areas (only properly aligned pairs).
COMPR_PE low CE-statistics computed on PE-reads.
STRECH_PE high CE-statistics computed on PE-reads.
HIGH_SINGLE_PE high number of PE reads with unmapped pair . HIGH_SPAN_PE high number of PE reads with pair mapped in a different contig/scaffold . HIGH_OUTIE_PE high number of mis-oriented or too distant PE reads . COMPR_MP low CE-statistics computed on MP reads.
STRECH_MP high CE-statistics computed on MP reads.
HIGH_SINGLE_MP high number of MP reads with unmapped pair . HIGH_SPAN_PE high number of MP reads with pair mapped in a different contig/scaffold . HIGH_OUTIE_PE high number of mis-oriented or too distant MP reads . Table 1.
Description of implemented features. assembler Ctg NG50 Chaff Dupl Comp Indels Misjoins Inv Reloc Sens Spec(Kbp) (%) (%) (%)ABySS 246 34 6.66 23.06 1.05 10 6 4 2 99.25 62.70Allpaths-LG Table 2.
Staphilococcus aureus (GAGE) assembly evaluation and features estimation. For eachassembler we report the number of contigs/scaffolds produced (Ctg), the NG50, the percentage of short(Chaff) contigs, the percentage of duplicated (Dupl) and compressed (Comp) regions in the assembly(all the percentages are computed with respect to the real genome length), the number of long ( i.e. , > assembler Ctg NG50 Chaff Dupl Comp Indels Misjoins Inv Reloc Sens Spec(Kbp) (%) (%) (%)ABySS 1701 9 1.59 9.93 0.49 38 24 2 22 98.92 37.26Allpaths-LG
34 3,192 Table 3.
Rhodobacter sphaeroides (GAGE) assembly evaluation and features estimation. For eachassembler we report the number of contigs/scaffolds produced (Ctg), the NG50, the percentage of short(Chaff) contigs, the percentage of duplicated (Dupl) and compressed (Comp) regions in the assembly(all the percentages are computed with respect to the real genome length), the number of long ( i.e. , > assembler Ctg NG50 Chaff Dupl Comp Indels Misjoins Inv Reloc Sens Spec(Kbp) (%) (%) (%)ABySS 51301 2,1 34.78 0.48 0.44 762
22 15 7
225 81,647
150 90 60 92.13 65.38SOAPdenovo 13501 455 3.09 5.68 3.19 3902 1529 537 992 90.59 73.10Velvet 3565 1,190 4.23 0.08 0.53 4172 9525 4023 5502 91.60 67.55
Table 4.
Human chromosome 14 (GAGE) assembly evaluation and features estimation. For eachassembler we report the number of contigs/scaffolds produced (Ctg), the NG50, the percentage of short(Chaff) contigs, the percentage of duplicated (Dupl) and compressed (Comp) regions in the assembly(all the percentages are computed with respect to the real genome length), the number of long ( i.e. , > assembler Ctg NG50 Chaff Dupl Comp Indels Misjoins Sens Spec(Kbp) (%) (%) (%)BROAD 989
197 95.10 96.55DOEJGI 771 9,073 0.03 0.01 0.84 163
Table 5.
Assemblathon 1 assembly evaluation and features estimation. For each assembler/team wereport the number of contigs/scaffolds produced (Ctg), the NG50, the percentage of short (Chaff)contigs, the percentage of duplicated (Dupl) and compressed (Comp) regions in the assembly (all thepercentages are computed with respect to the real genome length), the number of long ( i.e. , >>