[PDF] Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons

Abstract

In just the last decade, a multitude of bio-technologies and software pipelines have emerged to revolutionize genomics. To further their central goal, they aim to accelerate and improve the quality of de novo whole-genome assembly starting from short DNA reads. However, the performance of each of these tools is contingent on the length and quality of the sequencing data, the structure and complexity of the genome sequence, and the resolution and quality of long-range information. Furthermore, in the absence of any metric that captures the most fundamental "features" of a high-quality assembly, there is no obvious recipe for users to select the most desirable assembler/assembly. International competitions such as Assemblathons or GAGE tried to identify the best assembler(s) and their features. Some what circuitously, the only available approach to gauge de novo assemblies and assemblers relies solely on the availability of a high-quality fully assembled reference genome sequence. Still worse, reference-guided evaluations are often both difficult to analyze, leading to conclusions that are difficult to interpret. In this paper, we circumvent many of these issues by relying upon a tool, dubbed FRCbam, which is capable of evaluating de novo assemblies from the read-layouts even when no reference exists. We extend the FRCurve approach to cases where lay-out information may have been obscured, as is true in many deBruijn-graph-based algorithms. As a by-product, FRCurve now expands its applicability to a much wider class of assemblers -- thus, identifying higher-quality members of this group, their inter-relations as well as sensitivity to carefully selected features, with or without the support of a reference sequence or layout for the reads. The paper concludes by reevaluating several recently conducted assembly competitions and the datasets that have resulted from them.

Full PDF

11 Reevaluating Assembly Evaluations with Feature ResponseCurves: GAGE and Assemblathons

Francesco Vezzi , ∗ , Giuseppe Narzisi , Bud Mishra , , School of Computer Science and Communication, KTH Royal Institute of Technology, Science for LifeLaboratory, 17121 Solna, Sweden Cold Spring Harbor Laboratory, Cold Spring Harbor, United States of America Courant Institute of Mathematical Sciences, New York University, New York, United States ofAmerica NYU School of Medicine, New York University, New York, United States of America ∗ E-mail: Corresponding [email protected]

Abstract

In just the last decade, a multitude of bio-technologies and software pipelines have emerged to revolu-tionize genomics. To further their central goal, they aim to accelerate and improve the quality of de novo whole-genome assembly starting from short DNA sequences/reads. However, the performance of each ofthese tools is contingent on the length and quality of the sequencing data, the structure and complexityof the genome sequence, and the resolution and quality of long-range information. Furthermore, in theabsence of any metric that captures the most fundamental “ features ” of a high-quality assembly, there isno obvious recipe for users to select the most desirable assembler/assembly. This situation has promptedthe scientiﬁc community to rely on crowd-sourcing through international competitions, such as Assem-blathons or GAGE, with the intention of identifying the best assembler(s) and their features. Some whatcircuitously, the only available approach to gauge de novo assemblies and assemblers relies solely on theavailability of a high-quality fully assembled reference genome sequence. Still worse, reference-guidedevaluations are often both diﬃcult to analyze, leading to conclusions that are diﬃcult to interpret. Inthis paper, we circumvent many of these issues by relying upon a tool, dubbed FRC bam , which is ca-pable of evaluating de novo assemblies from the read-layouts even when no reference exists. We extendthe FRCurve approach to cases where lay-out information may have been obscured, as is true in manydeBruijn-graph-based algorithms. As a by-product, FRCurve now expands its applicability to a muchwider class of assemblers – thus, identifying higher-quality members of this group, their inter-relations aswell as sensitivity to carefully selected features , with or without the support of a reference sequence or lay-out for the reads. The paper concludes by reevaluating several recently conducted assembly competitionsand the datasets that have resulted from them.

Introduction

The extraordinary advances in Next Generation Sequencing (NGS) technologies over the last ten yearshave triggered an exponential drop in sequencing cost, thus making it possible to perform whole-genomeshotgun (WGS) sequencing of almost every organism in the biosphere. In particular, recent WGS projectsare distinctive by the way they have facilitated whole genome sequencing at a high coverage ( i.e. , higherthan 50 × ), albeit, composed of relatively short sequences ( i.e. , reads).Despite this impressive progress, recent eﬀorts have underlined the diﬃculties in trading-oﬀ read lengthagainst read coverage. It is now well recognized how the short reads have made the assembly problemsigniﬁcantly harder [1] owing to the complexity involved in resolving ( i.e. span over) long repeats.Nonetheless, this challenge has been confronted recently with sophisticated and novel techniques,embedded in a diverse set of tools all aiming to solve de novo assembly problem. Such tools ( i.e. , assemblers ) are based on the simple assumption that if two reads share a suﬃciently long subsequence a r X i v : . [ q - b i o . GN ] O c t then they are likely to belong to the same location in the genome. In order to represent and eﬃcientlyuse such information for myriads of short reads, assemblers typically rely on compressed graph structures(often de-Bruijn graphs but also string-graphs). Moreover, additional heuristics are employed for errorcorrection and read-culling.More than twenty diﬀerent assemblers have been designed to tame the computational complexity ofassembling NGS reads, with the vast majority of them speciﬁcally targeting Illumina reads. One of themain consequences of this proliferation in software production is the diﬃculty in selecting one assemblerover another, which often makes a Buridan’s ass of a bioinformatics researcher: Their eﬀort spent onselecting the best assembler ( i.e. , the largest haystack for the ass) ultimately diverts them from theirultimate objective of answering biological questions ( i.e. , leading to a confused and starving ass).Adding to the confusion, every new genome presents its own sets of problems, e.g , ploidy, heterozigos-ity, repetitive structures, etc. . The available assemblers usually are able to eﬃciently solve only someof these problems or are speciﬁcally designed for limited datasets ( e.g. , bacterial genomes). A widelyfollowed approach is to use multiple assemblers, run with diﬀerent parameters, producing statistics thatcould point to the best among them. However, no clear way to select the “best” assembler has yetmade itself obvious. As noticed by Miller in [2] all new published assemblers have been compared tothe then-existing tools showing, every time, their better performances on a speciﬁc dataset and on somespeciﬁc metrics. More often than not, only traditional metrics ( i.e. , contiguity-based metrics) are used incomparing assemblers’ performances ( e.g. , number of contigs, NG50, etc. ) – a strategy that suﬀers fromthe drawback of emphasizing only assembly size. Moreover, in [3], NG50 (the most “abused” metric)has been demonstrated to be a bad assembly quality predictor. In contrast, more reliable results can beproduced, when a reference sequence is available, since contigs could be aligned against it in order tojudge the number of errors ( i.e. , reference-based metrics). Unfortunately, currently, no eﬀort is usuallymade in weighting or scoring qualitatively diﬀerent types of errors, thus reducing this approach to asimple error counting without accounting for subtle diﬀerences among the diﬀerent types of errors.More recently, the focus has shifted from seeking just contiguity to assembly precision. An earlierstudy [4] showed that in the published and revised human genome [5] on average 10% of assembledfragments were assigned the wrong orientation and 15% of fragments, placed in a wrong order. Recallthat this draft sequence of the Human Genome [5], which was released in 2001, had taken several largeteams more than ﬁve years to ﬁnish and validate (but only at a genotypic level). With many projectsleft at draft level, NGS technologies have worsened this situation even further. Alkan in [6] criticizedtwo of the majors NGS achievements: the assembly of the Han Chinese and Yoruban individuals [7] bothsequenced with Illumina reads. Alkan identiﬁed 420 Mbp of missing repeat sequences from the Yorubanassembly, and estimated that in both assemblies almost 16% of the genome was missing.Despite these widely discussed and obvious problems, there still persists a lack of standard proceduresand methods to validate and evaluate assemblies. Several projects have been initiated to explore theparameter space of the assembly problem, in particular in the context of short read sequencing [6, 8, 9].Recently, a growing number of studies have aimed at independently evaluating diﬀerent assemblers orassembly pipelines. Assemblathon 1 [10] and

Assemblathon 2 sought to assess assemblers’ performanceson common datasets encouraging a competition among researchers/users and assemblers’ developers. Inits earliest version, the competition was performed on a simulated dataset, leaving open to a criticismof the eﬀectiveness of its genome and read simulators [3]. Assemblathon 1’s entries were evaluated andranked, based on a mixture of contiguity-based and reference-based metrics. The ﬁnal result is a largetable (see Table 3 in [10]) in which some assemblers perform well on some metrics while behaving poorlyon others, thus, leaving its interpretation somewhat equivocal.A similar but independent study, dubbed GAGE, has been designed to critically evaluate and compareassemblers on four diﬀerent large-scale NGS projects [11]. The presence of an already assembled referencesequence for three of the studied genomes allowed the authors to assess assembly quality. One of themain message of this study is that the same assembler can produce utterly diﬀerent qualities of resultson diﬀerent datasets. Moreover, Salzberg and colleagues showed how assemblers’ performance is aﬀectedby data quality: preprocessing used in read correction seems fundamental to improve assemblers’ results.The main conclusion of this study is that there is no universal “assembly recipe” to be used for assemblingnew genomes. An assembler working well on certain genomes may exhibit drastically poorer performancewhen used to assemble even a fairly similar genome. A fundamental criticism against GAGE is how itselects the “best” assembly for each assembler and for each dataset: GAGE’s authors chose the assemblywith the largest NG50, thus building on an extremal statistic which, as mentioned earlier, also happensto be the worst quality predictor [3]. As in Assemblathon 1, GAGE output is presented as a set of tableswith massive amount of—often hard-to-interpret—information.This state of aﬀairs is not completely surprising, given the complexity of assembly evaluation, es-pecially, when all errors cannot be substantially eliminated. For instance, even after six months sinceAssemblathon 2’s competition, an oﬃcial ranking remains undisseminated (except for the one based onNG50).Recently, Narzisi and Mishra in [12] proposed a new metric, Feature Response Curve (FRCurve),capable of capturing the trade-oﬀ between contig contiguity and correctness. FRCurve is based on theprinciple that the assembly precision can be predicted by identifying on each contig a set of suspiciousregions ( i.e. , features ): contigs are then sorted from the longest to the shortest, and for each featurethreshold δ only the longest contigs whose total sum of features is less than δ are used to compute thegenome coverage ( i.e. , a single point in the FRCurve). Such technique has been extensively studiedand evaluated in [3]. Despite its power the main limitation of FRCurve is that it requires the so-called read layout , a standard output of Sanger-based assemblers, but missing in the vast majority of NGSassemblers. Such dependency restricts FRCurve analysis tools to only OLC, overlap-layout-consensusbased assemblers and thus to a limited subset of NGS-based studies.In this paper, we present an enhanced tool, named FRC bam , capable of computing FRCurve fromthe alignment of the reads to the assembled contigs. In particular, we show that this method is able tocorrectly and rigorously evaluate assemblers’ performance and precision, even in the absence of a referencesequence, while using a broad set of metrics, not just those based on assembly contiguity. We begin bydescribing the set of implemented features, and then evaluate our tool on the datasets used in the threemajor assembly evaluations eﬀorts: GAGE, Assemblathon 1 and Assemblathon 2. Materials and Methods

Almost always, de novo assembly is carried out using more than one library. In the Illumina scenario wetypically have at least two libraries: one paired-end library (PE), and one mated-pair library (MP). Theformer provides paired reads in the standard orientation ( → ← ) with insert size that can vary between150 bp (overlapping fragments) and 1000 bp (standard PE). The latter yields pairs of sequences in theopposite direction ( ← → ) and the insert size is much longer (usually in the range between 3 and 10 Kbp).Due to the diﬀerent cost of the two protocols a typical sequencing project consists of one high coveragePE library and one low coverage MP library. The main advantage of MP reads is to improve contiguitythrough scaﬀolding and gap-ﬁlling procedures. However, the MP library is intrinsically more diﬃcult toobtain than standard PE libraries and are usually aﬀected by redundancy (PCR duplicates) and unevengenome representation.After PE reads and MP reads are aligned against the assembly itself, the ordered and indexed BAMﬁles will be the input of FRC bam . FRC bam needs at least one PE library and, if available, one MP library.The user needs to provide a rough estimation of the insert size and of the standard deviation for bothlibraries and an estimation of the genome length. Read coverage and spanning coverage are computeddirectly from the BAM ﬁles.Several features are computed in order to identify problems related to read coverage, mate pair hap-piness [8], and compression/expansion events ( i.e. , CE-statistics) [13]. As a consequence of their diﬀerentnature, PE reads and MP reads are used to compute two diﬀerent sets of features. The former is usedto compute the following features:

LOW_COV_PE , HIGH_COV_PE , LOW_NORM_COV_PE , HIGH_NORM_COV_PE , COMPR_PE , STRECH_PE , HIGH_SINGLE_PE , HIGH_SPAN_PE , and

HIGH_OUTIE_PE . The latter library is usedto compute only a subset of the features, similar to the ones in the previous set:

COMPR_MP , STRECH_MP , HIGH_SINGLE_MP , HIGH_SPAN_MP , and

HIGH_OUTIE_MP . The main diﬀerence is due to the fact that MPreads usually provide a low read coverage ( i.e. vertical) but produce a high spanning coverage ( i.e. hor-izontal). Therefore MP reads are best used to compute features related to long range information (seeTable 1 and Supplementary material for a detailed description of features).FRC bam output consists of several ﬁles: ( a ) the FRCurve itself (to be plotted), ( b ) the FRCurves foreach individual feature, and ﬁnally, ( c ) a position-by-position description of the feature (in GFF format).This last ﬁle holds for each contig the identiﬁed features, together with the start and end points. Datasets

For comparative analysis of NGS assemblers, both GAGE and Assemblathon studies oﬀer state-of-the-art datasets, which could also be re-purposed to evaluate reliability of the new FRC bam . These datasetswere of particular interest to us for several reasons, falling into three categories: ( i ) datasets consistof state-of-the-art sequences, with reads often belonging to several paired-end and mate-pairs libraries;( ii ) availability of already “optimized” assemblies; ( iii ) presence of a reference sequence for most of thesequenced organism.The ﬁrst category allowed us to test FRC bam against state-of-the-art datasets and to take advantageof diﬀerent insert-types. The second category enabled us to use assemblies that may be considered as the“best” achievable, since they were obtained by de novo assembly experts ( i.e. , GAGE) or by the sameassemblers’ developers ( i.e. , Assemblathon). Speciﬁcally, the availability of a reference sequence, allowus to measure assemblies’ correctness, thus also demonstrating how FRC bam and the computed featuresare able to eﬀectively gauge assembly accuracies and to identify suspicious regions ( i.e. , mis-assemblies).In total, we tested FRC bam on ﬁve datasets: Staphilococcus aureus , Rhodobacter sphaerodis , andHuman chromosome 14 from GAGE, data of simulated genomes from Assemblathon 1 competition, and

Boa constrictor ( i.e. , Snake) from Assemblathon 2 competition. All ﬁve datasets are composed of highcoverage ( i.e. , all exceeding 40 × ) Illumina paired-end and mate-pair reads libraries. S. aureus has beenassembled with 7 diﬀerent assemblers (see Table 2),

R. sphaerodis and Human chromosome 14 (hereafterHc14) have been assembled with 8 diﬀerent assemblers (see Tables 3 and 4). Assemblathon 1 andAssemblathon 2 comprise 59 and 12 entries respectively. The large number of Assemblathon 1 entries issimply a consequence of the rule to permit multiple submissions: we decided to download only the bestentry from each team, as determined by the Assemblathon 1 ranking (refer to [10] for more details), fora total of 17 entries. Summarizing, we tested FRC bam on ﬁve extremely diﬀerent datasets for a total of43 assemblies.For each dataset we selected one paired-end library and one mate-pair library (see Supplementarymaterial for more details). These two libraries were then aligned against the available assemblies us-ing rNA [3]. We aligned reads using also BWA [14] without detecting any noticeable diﬀerence (seeSupplementary material).Using libraries with diﬀerent insert sizes ( i.e. , paired-end and mate-pair reads) enabled us to identifydiﬀerent features types. On the one hand, paired-end reads, characterized by a short insert size ( i.e. ,usually less than 600 bp) are able to highlight local mis-assemblies and relatively small insertions/deletionsevents. On the other hand, mate-pairs, characterized by a larger insert size ( i.e. , usually more than 2Kbp) are able to highlight larger insertion/deletion events and larger mis-assemblies ( e.g. , scaﬀoldingerrors).

Results

Figure 1 shows FRCurves for the three GAGE genomes (

S. aureus

Figure 1(a),

R. spheroides

Figure 3,and Hc14 Figure 1(c)) and for Assemblathon 1 entries (Figure 1(d)). For each of the analyzed assemblieswe aligned contigs against the reference genome. To accomplish this task we employed the scripts availableon GAGE website [11]. Assembly statistics are reported in Tables 2, 3, 4, and 5.The four tables (Tables 2, 3, 4, and 5) report for each assembly/assembler the number of con-tigs/scaﬀolds produced (Ctg), the NG50, the percentage of short ( i.e. less than 200 bp) contigs, thepercentage of duplicated (Dupl) and compressed (Comp) regions in the assembly (all the percentages arecomputed with respect to the real genome length), the number of long ( i.e. , > dnadiﬀ [8]we could identify regions of real mis-assemblies, thus enabling us to compute sensitivity and speciﬁcityof our features. Note that sensitivity is deﬁned as the ratio between true positives ( i.e. , positions markedas mis-assembled by dnadiﬀ and labelled by one or more features), and its sum with false negatives ( i.e. ,positions marked as mis-assembled by dnadiﬀ but not labelled by any feature). Speciﬁcity, instead, isthe ratio between true negatives ( i.e. , positions not marked as mis-assembled by dnadiﬀ and not labelledby any feature) and its sum with false positives ( i.e. , positions not marked as mis-assembled by dnadiﬀ but labelled by one or more features). The ﬁrst measure enables FRCurve to identify problematic areas,while the latter measure distinguishes non -problematic from problematic regions ( e.g. , if a feature marksall position in an assembly the sensitivity will be 1, however the speciﬁcity is likely to be close to 0). GAGE

Figure 1(a) and Table 2 show the FRCurve and the reference guided validation of

S. aureus

GAGE’sdataset respectively. From Figure 1(a) MSR-CA and Allpaths-LG appear to be the best performingassemblers on such datasets ( i.e. , the sharpest curves). These two assemblers are closely followed bySOAPdenovo, Velvet, and Bambus2, while SGA and ABySS clearly show bad performance. Both sen-sitivity and speciﬁcity of reported features are high (last two columns of Table 2), thus demonstratingthat FRC bam (and therefore our features) is able to correctly identify suspicious regions. Speciﬁcityis not particularly high only for ABySS and SGA. However, in these two assemblies the percentage ofmis-assembled sequences identiﬁed by dnadiﬀ are 20% and 8%, respectively, suggesting a high numberof problematic regions close to the real mis-assembly sequences.Some remarks are warranted on the stepwise shape of some curves ( e.g. , MSR-CA, Allpaths-LG andBambus2). Such a shape indicates the presence of contigs with a large number of features that interruptsa smooth growth of the curve, which is particularly discernible when the number of contigs is low. Asan example, consider the longest MSR-CA contig containing almost half of the features identiﬁed inthe entire assembly. The high sensitivity and speciﬁcity reported in Table 2 show that these featuresrepresent truly problematic regions. Let us focus on Allpaths-LG and MSR-CA: in Figure 2 we present thealignment of the longest scaﬀold produced by Allpaths-LG and MSR-CA against the reference genome.From Figure 2(b) it is clear that the stepwise shape of MSR-CA’s FRCurve is a consequence of wrongchoices made by the assembler. The situation is diﬀerent in the Allpaths-LG case: Figure 2(a) shows acorrectly reconstructed scaﬀold, therefore there is apparently no reason to justify the stepwise curve ofAllpaths-LG. Puzzled by this anomaly, we plotted the FRCurve for each single feature (see Supplementarymaterial). With this analysis, we discovered that Allpaths-LG has the best curve in the majority of thecases. However, there are two exceptions:

STRECH_MP and

COMPR_MP features, which are representativeof compression or expansion events. Areas characterized by these features coincide with the circles inthe dotplot (see Figure 2(a)): these areas involve small mis-joins ( i.e. , less than 50 bases) or scaﬀoldjunctions ( i.e. , sequences of Ns). A likely explanation is that such small mis-joins are able to “attract”reads that are responsible for the features. Moreover,

STRECH_MP and

COMPR MP features depend on CEstatistics [13] and therefore on the choice of two thresholds, often estimated sub-optimally — note that,despite the availability of a reference sequence, these thresholds were estimated without it. MSR-CA isthe assembly characterized by the largest number of areas composed of large numbers of mis-orientedmate/paired reads ( i.e. , HIGH_OUTIE_PE and

HIGH_OUTIE_MP ), as a consequence of the large inversionsand translocations present in the ﬁrst scaﬀold. Other hints about MSR-CA’s problems come from theFRCurve obtained from the contigs (see Supplementary material): MSR-CA’s FRCurve is not as goodas those of Allpaths-LG, SOAPdenovo and Bambus2.This situation demonstrates that assembly evaluation is extremely diﬃcult. With the help of areference sequence it is clear that MSR-CA suﬀers from a large number of errors (see Table 2). However, inits absence, many users might have chosen MSR-CA over others, since it seemed to be able to reconstructalmost the whole genome with a single scaﬀold. FRCurve, without the use of a reference, was ableto raise doubts about MSR-CA ( i.e. , the only assembler with a high number of

HIGH_OUTIE_PE and

HIGH_OUTIE_MP features), thus suggesting a more careful manual validation on Allpaths-LG.According to the FRCurve analysis, SGA (together with ABySS) is one of the worst performingassemblers. Although GAGE analysis concludes that SGA introduces relatively fewer errors, it is also themost fragmented one, consisting of 456 scaﬀolds (and 1252 contigs). This kind of assemblies, despite itslow error-rate, tends to accumulate features related to copy number variation problems ( e.g. , LOW_COV_PE )and features like

HIGH_SPAN_MP suggesting problems in the scaﬀolding ( i.e. , either errors in the scaﬀoldingor a failure in establishing contig connections).Similar analyses can be carried out for

R. sphaeroides and Hc14 datasets whose FRCurves are repre-sented in Figures 1(b) and 1(c).In

R. sphaeroides dataset Allpaths-LG and MSR-CA again appear to be the two best performingassemblers, though SOAPdenovo, Velvet, and Bambus2 are not too far behind. The longest Allpaths-LG scaﬀold practically reconstructs the longest Rhodobacter chromosome: such scaﬀold contains only100 features most of them suggesting the presence of regions aﬀected by low paired read coverage ( i.e. , LOW_NORM_COV_PE and

LOW_COV_PE ). Such features aﬀect all others assemblers similarly. From FRCurveanalysis one may conclude that Allpaths-LG is the best performing tool. The alignments of Allpaths-LGassembly against the reference further conﬁrm this conclusion (see Supplementary material).Bambus2 is characterized by a long (correct) scaﬀold that contains almost one third of its fea-tures. This situation is a consequence of regions composed of a large number of singleton reads ( e.g. , HIGH_SINGLE_MP ) and of areas suggesting the presence of compression events ( e.g. , COMPR_MP ). Similarlyto the analysis of the

S. aureus , these features seem to coincide with small gaps (as the alignment of thelongest Bambus2 scaﬀold against the reference sequence shows, see Supplementary material).From Figure 1(b) CABOG appears not to be a very well performing assembler. Such situation isconﬁrmed by Figure 3 that shows the dotplot for CABOG’s longest scaﬀolds. The green columns at thebottom of the dotplot indicate the position where one or more features have been found by FRC bam .This plot shows how features are able to highlight problematic regions in the assembly, as the majorityof them coincide with the mis-assemblies.In the Hc14 case (see Figure 1(c)) Allpaths-LG and CABOG are clearly the best two assemblers.Allpaths-LG is the only assembler able to assemble almost all the sequences in a single scaﬀold contain-ing, practically, all the features. The total number of features identiﬁed on this long scaﬀold is lowerthan the total amount of features identiﬁed in the 400 longest CABOG scaﬀolds. When we consider theFRCurves for each individual feature (see Supplementary material), we notice that Allpaths-LG longestcontig is characterized by a large number of features suggesting coverage problems ( e.g , LOW_NORM_COV_PE , LOW_COV_PE , HIGH_NORM_COV_PE , and

HIGH_COV_PE features) and mated/paired read orientation prob-lems ( e.g. , HIGH_OUTIE_PE , and

HIGH_OUTIE_MP features). As far as the coverage features are concerned,Allpaths-LG has almost always a lower number of such features than the other assemblers. Moreover,

LOW_NORM_COV_PE feature is often the consequence of Allpaths-LG’s ability to correctly resolve repeatedregions (pairs are not correctly aligned as a consequence of a repeat, see Supplementary material). Lessstraightforward is the explanation for the large number of features suggesting the presence of a largenumber of mis-oriented pairs (in this case Allpaths-LG being one of the worst assemblers). Such featuresare indicative of inversions and insertions events, although the dotplot shows an almost contiguous scaf-fold that reconstructs the Chromosome 14 without any particular problem (see Supplementary material).After a closer inspection, we discovered that such long scaﬀold is aﬀected by a large number of smallmis-joins as suggested by the circles in the main dotplot diagonal (see Supplementary material). Wetested 10 diﬀerent areas subject to such mis-joins and in all cases we discovered either a scaﬀold joint istoo large or a scaﬀold joint has a short chimeric sequence in the middle, thus explaining the presence ofa feature. The presence of these small mis-joins has been reported also in GAGE analysis: in the Hc14dataset, the NG50 was close to 81 Mbp while the corrected-NG50 was 20 times shorter (the corrected-NG50 is the NG50 computed after breaking contigs at mis-assembled positions identiﬁed by the referencesequence). The low number of compression/expansion features ( i.e. , CE statistics), as well as the lownumber of high-spanning and high-single reads related features in Allpaths-LG assembly (see FRCurvesplots in Supplementary material) suggests that Allpaths-LG is able to return an assembly that is highlyand correctly connected. However, the relatively large number of paired-end related features suggests thepresence of small local mis-assemblies. On the other hand, CABOG produced a more fragmented assem-bly characterized by a small number of features. CABOG’s most frequent features ( i.e. , HIGH_SPAN_PE and

HIGH_SPAN_MP ) suggest a systematic failure during the scaﬀolding phase in correctly merging contigsand inferring their order.From FRCurve analysis alone, it is much harder to decide between the top two assemblers: Allpaths-LG and CABOG, though when the reference sequence is available, it is evident that Allpaths-LG suﬀersless from errors than CABOG (see Table 4). When considering only contigs (see Supplementary material)CABOG and Allpaths-LG still outperform other assemblers, as clearly proved by GAGE analysis (longestNG50).With almost 30,000 features MSR-CA is the third ranking assembler as determined by the FRCurveanalysis. MSR-CA is closely followed by SOAPdenovo and SGA. It is again diﬃcult to fully ascertainsuch ranking, and even the reference guided validation in Table 4 does not lead to a clear and conclusiveopinion. The majority of SGA’s features are a consequence of the highly fragmented assembly (see

HIGH_SINGLE_MP

FRCurve in Supplementary material). However the small number of errors (see Table4) demonstrates that the ﬁnal sequences are correct. SOAPdenovo is slightly better than MSR-CA asfar as the number of errors is concerned, notwithstanding the fact that SOAPdenovo is more fragmentedthan MSR-CA. SOAPdenovo is particularly aﬀected by the presence of mis-oriented paired reads ( i.e. , HIGH_OUTIE_PE feature).In all the three GAGE datasets the sensitivity of the FRC bam is almost always higher than 90%(CABOG is an exception, but it must be noted that the percentage of mis-assembled sequences is less than1.4% of the genome length). Speciﬁcity is in general high, with the exception of assemblies characterizedby high errors rates ( e.g. , more than 40% of Velvet assembly is marked as suspicious by dnadiﬀ on Hc14).

Assemblathon 1

Assemblathon 1 dataset diﬀers from that of GAGE mainly in two ways: it is much larger and it is obtainedsolely by simulation. Figure 1(d) and Table 5 summarize the analysis performed on such datasets. It is ofparticular interest to compare FRCurve assembly evaluation with Assemblathon 1 paper evaluation [10].The order of the entries in Table 5 and of the legend in Figures 1(d) follows the Assemblathon 1 ranking.Despite the presence of some outliers, the FRCurve analysis is close to the ranking obtained by Earl et al. . BGI, WTSI-S, DOEGI, and CSHL were found by the FRCurve analysis to be better performingassemblers. They, together with Broad Institute’s ( i.e. , Allpaths-LG), were the ﬁve best assemblers ac-cording to Assemblathon 1 ranking. A similar analysis could determine the worst performing assemblers.CIUoC, GACWT, UCSF, ASTR, and IRISA are clearly characterized by undesirable FRCurves (CIUoC’slong contigs contain few errors, even though the assembly contains only a fraction of the whole genomeand small contigs contain many features).There are some clear diﬀerences, for example Broad’s Allpaths-LG, the best assembler in Assem-blathon 1 ranking is clearly among the best ones also in our FRCurve-based analysis, but has a highnumber of features suggesting problems with paired reads ( i.e. , LOW_NORM_COV_PE and

HIGH_SPAN_PE features). We discovered that these two features are highly correlated: in all the analyzed cases wediscovered the presence of a small contig perfectly (or almost perfectly) aligning against a larger contig,probably the result of a wrong copy number estimation or of an unresolved allele splitting event. Thisobservation is consistent with the analysis by Eearl et al. , as, for instance, Broad’s entry ranks 11 th forcopy number statistics.Another clear diﬀerence is CRACS, the 6 th ranking assembler in Assemblathon 1 evaluation, but anaverage performing assembler according to FRCurve analysis. The poor performance of this assembler isobserved in a series of long contigs all exhibiting an extremely high coverage ( i.e. , HIGH_COV_PE ). Thisis clearly reﬂected also in the ranking given by Assemblathon 1: CRACS has clear problems in inferringcopy number variation (12 th ranking tool) and it reconstructs only 96% of the genome (14 th rankingtool). FRCurve analysis suggests two possible solutions: either discard contigs strongly aﬀected by thisfeature, or have CRACS developers reimplement an improved copy number variation estimation.The last two assemblers we considered are RHUL and IoBUGA. Also in this case, these assemblers haveFRCurves comparable to the best assemblers, but have been ranked below the median in Assemblathon-1’sevaluation. According to Assemblathon-1’s evaluation, RHUL has an acceptable number of substitutions(5 th ranking tool); it is able to assemble sequences in the right copy number (5 th ranking tool); and itis able to reconstruct (cover) the large part of the reference (4 th ranking tool). However, it lacks goodconnectivity (13 th ranking tool). FRCurve shows this assembler to contain most of its features in thelongest scaﬀolds, while the short ones contain a small number of features. Note that the longest of RHUL’s scaﬀolds generates a curve similar to ASTR’s. IoBUGA oﬀers a similar story. Assemblathon-1’s rankingis diﬃcult to interpret (15 th ranking tool for substitutions and gene coverage but 3 rd ranking tool forcopy number variation). This situation reemphasizes that reference guided validations are extremelydiﬃcult to interpret, especially when a tool exhibits contradicting performance. It should also be pointedout that IoBuga has the lowest sensitivity (see Table 5). It is clear that new features may be addedin order to improve the eﬀectiveness of FRC bam and FRCurve analysis. In this case, the availability ofRNA-seq data may allow design of new features, capable of capturing assemblers’ ability to reconstructgene expressions, splicing variants and intron-exon boundaries. Assemblathon 2

As shown earlier, the GAGE datasets were suﬃcient for testing the performance of FRC bam using onlyrelatively small datasets. But with access to reference sequences, some of the limitations of the analysisbecame evident: only

S. aureus and

R. spaeroides are realistic datasets, while Hc14 has been partiallysimulated (reads have been aligned and extracted, see [11] for more details). Moreover,

S. aureus and

R. spaeroides datasets are extremely small in size and, to some extent, represent fairly easy-to-assemblegenomes ( i.e. , no heterozygosity or high ploidy). With access to Assemblathon 1 data, we further testedthe FRC bam against a larger dataset that was previously analyzed and ranked. The main limitation ofthis dataset stemmed from the use of simulated reads, which often diverged from any reasonable modelof reality.In order to show the applicability of our method to larger sequencing projects we tested the FRC bam on all Assemblathon 2 entries for the Snake dataset (

Boa constrictor ). Results are shown in Figure 4.Surprisingly, when all features are considered all together, their FRCurves coincide closely with each other(see Figure 4(a)) suggesting that Assemblathon 2 participants, or the tools used by them, are convergingto common results. We can identify two teams (assemblers) that are doing better than the others: SGAand Meraculous. There is a dense conglomerate of similarly behaving assemblers consisting of ABySS,Phusion, SOAPdenovo, CRACS, and Ray. Other assemblers appear less promising, though, except forthe sole example of PRICE, none of them show unacceptably bad performance. The good performanceof CRACS on this dataset brings to mind how drastically diﬀerently the same assembler could behave ondiﬀerent datasets.Results are diﬀerent if we concentrate on one feature at a time (see Figure 4(b) and Supplementarymaterial). As an example, by inspecting the plot for the

HIGH_SPAN_PE feature, we observe that GAMoutperforms all the other assemblers. Meraculous and SGA show good performance too, together withCurtain, Symbiose, and BCM-HGSC.

HIGH_SPAN_PE feature indicates presence of mis-joins, as oftenpresumably close-by pairs are found in diﬀerent contigs/scaﬀolds.Particularly interesting is the FRCurve plot describing the presence of areas composed mainly ofsingle ended reads ( i.e , HIGH_SINGLE_MP feature, see Supplementary material). All assemblers are stronglyaﬀected by this feature demonstrating a general failure of all tools. A likely explanation is in a systematicfailure in correctly assembling heterozygous loci , which generates holes in the assemblies, thus confoundingthe assemblers attempting to place both reads of mate-pairs. Note that this behavior is not present inthe

HIGH_SINGLE_PE features. A feature like

HIGH_SINGLE_MP is clearly not informative in this datasetand may be ignored without aﬀecting the analysis.

Discussion

Limitations of de novo assembly evaluation

The rapidly growing set of new assemblers aims to address the need for assembly tools capable of handlingthe vast amount of data produced by NGS ( e.g.

Illumina) sequencers. This growth in data and tools,however, has led to another unmet need: a rigorous comparative study of these assemblers, which so farhas only been carried out in a rather na¨ıve way. Developers have focused more on performance ( e.g. ,RAM and CPU time) and connectivity ( e.g. , contig number and NG50) rather than on correctness.A commonly employed approach, currently being used to validate and gauge assemblies, is basedon a plethora of standard validation metrics . We can identify four main groups: length-base statistics,reference-based statistics, simulation-based statistics, and long-range-information (LRI) based statistics.Length-based statistics take into account only the size of the assembler output. These statisticscomprise mean contig length, maximum contig length, and NG50. NG50, in principle, gives an ideaof assemblies’ connectivity level. All length-based statistics are not linked to assembly correctness andemphasize only length: an assembler that eagerly merges together contigs can produce assemblies char-acterized by a large NG50 and by few long contigs. However, these long contigs are of no use if theycontain too many misassemblies. NG50 has been shown in [3] to be a bad quality predictor. Never-theless, length-based statistics are the basic, and some times the only, method used to judge assemblersperformances, especially when the assembly tools are new [15, 16].Assembly analysis would trivialize if the genome to be assembled was already available, which wouldmake it possible to compare assemblers using only the reference-based statistics. The strategy would beto resequence an organism with an already available fully ﬁnished whole genome reference sequence. Thisapproach would enable comparing assemblers from the computed real number of errors. The underlyingpremise is that good performances of an assembler on one dataset should reﬂect behavior on a widerrange of datasets. However, studies like GAGE has shown that the same assembler can produce utterlydiﬀerent results on diﬀerent genomes and diﬀerent datasets — thus dashing any hope of generalizing theperformance of a tool on the basis of a single dataset. Moreover, reference-metrics are in general diﬃcultto interpret or, at least, are open to several interpretations: as an example reference-based metrics havebeen used both to demonstrate the high quality assembly of two human individuals in [7] as well as todemonstrate the opposite (their poor quality) in [6].Simulation-based statistics face even more extreme hurdles: reads are simulated from a referencesequence and subsequently assembled. Vezzi et al. showed in [3] that simulated reads are likely to produceunrealistic contigs that cannot be used to judge assemblers’ performance. Despite these shortcomings,0competitions like Assemblathon 1 have continued to use a simulation-based approach.A more reasonable way to assess assembly correctness consists in the use of long range information.Second Generation Technologies are able to produce mate-pairs , that are pairs of reads at a mean distanceof 2 − i.e. , mate-pair happiness). Other two commonly used LRI-methods are physical maps [17]and optical maps [18, 19]. Both rely on the relative locations of diﬀerent genes and other DNA sequencesof interest in the genome. Third Generation Sequencing Technologies (also known as Single MoleculeSequencing Technologies) and dilution-based sub-genomic sampling can also be used in the near futureto estimate assembly correctness. The main drawback of LRI statistics is the fact that they requirethe production of new and often expensive data. Moreover, apart from the simple counting, it remainsunclear how such information should be used to rank diﬀerent assemblies that currently exist. FRCurve

The aim of this work is to present a new simple tool able to accurately evaluate assemblies and as-semblers’ performance even in the absence of a reference sequence. Features have been ﬁrst introducedin [8] to identify possible mis-assemblies. Narzisi and Mishra [9] used such features to compute the socalled Feature Response Curve (FRCurve). FRCurve is closely connected to the standard receiver op-erating characteristic (ROC) curve: the Feature-Response curve characterizes the sensitivity (coverage)of the sequence assembler output (contigs) as a function of its discrimination threshold (number of fea-tures/errors). Given a set of features, the response (quality) of the assembler output is then analyzed asa function of the maximum number of possible errors (features) allowed in the contigs. More speciﬁcally,for a ﬁxed feature threshold τ , the contigs are sorted by size and, starting from the longest, only thosecontigs are tallied, if their sum of features is less than τ . For this set of contigs, the correspondingapproximate genome coverage is computed, leading to a single point of the Feature-Response curve.Vezzi et al. [3] analyzed Feature space using multivariate techniques ( i.e. , PCA and ICA) in order tostudy features’ interactions and to use these to select the most important ones. Such study, however,highlighted one of the main weak points of FRCurve: the need of a layout ﬁle, that is, a ﬁle describing thepositions and orientations of each read (and therefore, each pair). While this ﬁle had been standard withold Sanger-based assemblers, only a small fraction of NGS-based assemblers provide such information( i.e. , Velvet, Ray, Sutta). Another relevant problem, deeply connected to the ﬁrst, is the fact thatfeatures were computed by amosvalidate . Such features are commonly available for Sanger reads, clearlycharacterized by widely-varying insert-size distributions and expected coverages.Results summarized in this paper clearly show that FRC bam is able to eﬀectively detect mis-assembliesand that it is able to rank assembler performances. The tool achieves high sensitivity and high speciﬁcitythus demonstrating that the implemented features are able to capture the large majority of the problems.Currently 9 features are computed using reads from paired-end libraries, while other 5 are computedusing reads from a mate-pair library. FRCurve is computed using all of them, however the user isfree to concentrate only on a subset of them (PCA can be used as shown in [3] to study features,see Supplementary material). New forensics features can be easily added to the program in order tohighlight new problematic regions: small indels can be identiﬁed using reads aligned with gaps ( i.e. , readsaligned with Smith-Waterman-like algorithm), problems in reconstructing gene space can be identiﬁedusing RNA-seq reads, physical-maps or long single-molecule-sequences can be used to compute features,highlighting scaﬀolders’ performance.Mapping reads back to the assembly provides only a rough approximation of the layout generation,especially in presence of repeat-structures: in such cases, reads that belong to correctly (or incorrectly)reconstructed duplicated regions can only be mapped randomly on one of the possible occurrences, thus,1jeopardizing the hope of obtaining a correct layout. FRCurve’s ability to detect mis-assemblies is clearlylimited by the presence of non-uniquely aligning reads ( i.e. , reads aligning optimally in two or morepositions). Thus, as the repetitive structures in a genome increase, which complicates the assemblyproblem, so does the diﬃculty in providing valid assembly evaluation. As the read-lengths increaseor mate-pairs of diﬀerent lengths become feasible, not only does the assembly problem become moretractable, but also new features enable better identiﬁcation of problematic regions.Despite the severe limitations imposed by the strategy of approximating read layout with read align-ment, the present trend suggests that assemblers may continue to avoid producing layout ﬁles. Thus, itis believed that FRC bam and, more in general, forensics features, will need to be computed by mappingreads back to the assembled sequence. The approach to approximate the layout by mapping reads backto the assembly has several advantages: ( i ) possibility to scale to any genome size (FRC bam is currentlybeing used to evaluate Spruce genome assembly, which will produce a reference genome of length 20Gbp); ( ii ) possibility to compute new forensics features; ( iii ) study relationships among features in amore uniform way.Thanks to the feature-by-feature analysis, the FRCurve is often able to express and explain the currentlimitations of diﬀerent assemblers. In many situations it is straight-forward to rank the assemblers simplyby inspecting the FRC curves. Even when the scenario is unclear, FRCurve is still useful to highlightadvantages and disadvantages of one assembler over the other ( e.g. , an assembler that presents goodlong range connectivity but makes many mistakes in the small contigs, versus an assembler that has lowconnectivity but does not present local mis-assemblies). It is important to recall that, currently, none ofthe standard de novo evaluation metrics is able to capture these situations in the absence of a referencesequence.We believe that features-based analysis will guide eﬀorts aimed at de novo assembly evaluation and de novo assembler design. Our results clearly show that FRCurve can easily separate the best assembliesfrom the worst ones. By comparing feature-speciﬁc curves one can evaluate strong and weak pointsof each assembler and choose the system that best ﬁts one’s objective. It is hoped that, in future,assembler-developers will be guided by the features-based analysis to improve these tools — at the coreof the current genomic revolution. Software and Data Availability

The sequencing data used in this study is publicly available on the GAGE website and on the Assem-blathon website (details are available in Supplementary material). FRC bam source code can be down-loaded from https://github.com/vezzi/FRC align.git

Supplementary Material

All supplementary material is available at: ∼ vezzi/publications/supplementary.pdf and http://cs.nyu.edu/mishra/PUBLICATIONS/12.supplementaryFRC.pdf Acknowledgments

We would like to thank all the Spruce Assembly Project, in particular Prof. Lars Arvestad, Bj¨orn Nystedtand Nathaniel Street for their constant feedback and advice. We also wish to thank Mike Schatz of CSHL,Mihai Pop, XXX for many useful comments on an earlier draft of the paper. Moreover, FV would like2to thank Knut and Alice Wallenberg Foundation for their support. The research reported in this paperwas partially supported by an NSF CDI grant.

References

1. Nagarajan N, Pop M (2009) Parametric complexity of sequence assembly: Theory and applicationsto next generation sequencing. Journal of Computational Biology 16: 897–908.2. Miller J, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data.Genomics : 1–13.3. Vezzi F, Narzisi G, Mishra B (2012) Feature-by-Feature Evaluating De Novo Sequence Assembly.PLoS ONE 7: e31002.4. Semple CAM (2003) Assembling a View of the Human Genome, John Wiley & Sons, Ltd. pp.93–117. doi:10.1002/0470867302.ch5. URL http://dx.doi.org/10.1002/0470867302.ch5 .5. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. (2001) Initial sequencing andanalysis of the human genome. Nature 409: 860–921.6. Alkan C, Sajjadian S, Eichler E (2010) Limitations of next-generation genome sequence assembly.Nature methods 8: 61–65.7. Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform.Bioinformatics 26: 589.8. Phillippy AM, Schatz MC, Pop M (2008) Genome assembly forensics: ﬁnding the elusive mis-assembly. Genome biology 9: R55.9. Narzisi G, Mishra B (2011) Comparing De Novo Genome Assembly: The Long and Short of It.PLoS ONE 6: e19175.10. Earl DA, Bradnam K, St John J, Darling A, Lin D, et al. (2011) Assemblathon 1: A competitiveassessment of de novo short read assembly methods. Genome Research .11. Salzberg SL, Phillippy aM, Zimin aV, Puiu D, Magoc T, et al. (2011) GAGE: A critical evaluationof genome assemblies and assembly algorithms. Genome Research .12. Narzisi G, Mishra B (2010) Scoring-and-Unfolding Trimmed Tree Assembler: Concepts, Constructsand Comparisons. Bioinformatics (Oxford, England) 27: 153–160.13. Zimin AV, Smith DR, Sutton G, Yorke Ja (2008) Assembly reconciliation. Bioinformatics (Oxford,England) 24: 42–5.14. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform.Bioinformatics (Oxford, England) 25: 1754–60.15. Zerbino D, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijngraphs. Genome research 18: 821–829.16. Reinhardt J, Baltrus D, Nishimura M, WR (2008) Eﬃcient de novo assembly of bacterial genomesusing low coverage short read sequencing. Genome .17. Brody H, Griﬃth J, Cuticchia AJ, Arnold J, Timberlake WE (1991) Chromosome-speciﬁc recom-binant dna libraries from the fungus aspergillus nidulans. Nucleic Acids Res 19: 3105–3109.318. Anantharaman TS, Mishra B, Schwartz DC (1997) Genomics via optical mapping iii: Contiginggenomic dna and variations (extended abstract). AAAI Press, pp. 18–27.19. Anantharaman TS, Mishra B, Schwartz DC (1997) Genomics via optical mapping ii: Orderedrestriction maps. Journal of Computational Biology 4: 91-118.

Figure Legends a pp r o x i m a t e c o v e r ag e ( % ) Feature Space TOTAL

ABySSAllpaths-LGBambus2MSR-CASGASOAPdenovoVelvet (a) Staphylococcus aureus: GAGE entries ,

000 1 ,

100 1 ,

200 1 ,

300 1 ,

400 1 , a pp r o x i m a t e c o v e r ag e ( % ) Feature Space TOTAL

ABySSAllpaths-LGBambus2CABOGMSR-CASGASOAPdenovoVelvet (b) Rhodobacter: GAGE entries . . . . . . . . . . . · a pp r o x i m a t e c o v e r ag e ( % ) Feature Space TOTAL

ABySSAllpaths-LGBambus2CABOGMSR-CASGASOAPdenovoVelvet (c) Human chromosome 14: GAGE entries . . . . . · a pp r o x i m a t e c o v e r ag e ( % ) Feature Space TOTAL

BROADWTSI-SBGIDOEJGICSHLCRACSEBIBCCGSCIoBUGARHULWTSI-PDCSISUIRISAASTRUCSFGACWTCIUoC (d) Assemblathon1 entries

Figure 1.

FRCurve computed on three GAGE datasets and on Assemblathon 1 entries.4

NC_010079gi|161510924|ref|NC_010063.1|gi|225631039|ref|NC_012417.1|0 200000 400000 600000 800000 1000000 1200000 1400000

QRY (a) Staphylococcus aureus vs Allpaths-LG NC_010079gi|161510924|ref|NC_010063.1|gi|225631039|ref|NC_012417.1|0 500000 1000000 1500000 2000000

QRY (b) Staphylococcus aureus vs MSR-CA Figure 2. dotPlots for Staphylococcus: MSR-CA and Allpaths-LG longest contigs have been alignedagainst the reference genome.5

Figure 3.

Dotplot validation of the longest scaﬀold produced by CABOG on Rhodobacter dataset.The green line represents the Features identiﬁed by FRC bam . . . . . . . . . . . . · a pp r o x i m a t e c o v e r ag e ( % ) Feature Space TOTAL

BCM-HGSCRayCurtainGAMPhusionmeraculoussgaSymbioseABySSCRACSSOAPdenovoPRICE (a) Assemblathon 2 FRCurve: all features . . . . . . . . . . . . . . . . · a pp r o x i m a t e c o v e r ag e ( % ) Feature Space HIGH SPAN PE

BCM-HGSCRayCurtainGAMPhusionmeraculoussgaSymbioseABySSCRACSSOAPdenovoPRICE (b) Assemblathon 2 FRCurve: High spanning PE feature

Figure 4.

FRCurve computed on Assemblathon 2 entries. Figure 4(a) shows FRCurves for all thefeatures, while Figure 4(b) shows the FRCurves plotted on a single feature6

Tables

Feature Description

LOW_COV_PE low read coverage areas (all aligned reads).

HIGH_COV_PE high read coverage areas (all aligned reads).

LOW_NORM_COV_PE low paired-read coverage areas (only properly aligned pairs).

HIGH_NORM_COV_PE high paired-read coverage areas (only properly aligned pairs).

COMPR_PE low CE-statistics computed on PE-reads.

STRECH_PE high CE-statistics computed on PE-reads.

HIGH_SINGLE_PE high number of PE reads with unmapped pair . HIGH_SPAN_PE high number of PE reads with pair mapped in a diﬀerent contig/scaﬀold . HIGH_OUTIE_PE high number of mis-oriented or too distant PE reads . COMPR_MP low CE-statistics computed on MP reads.

STRECH_MP high CE-statistics computed on MP reads.

HIGH_SINGLE_MP high number of MP reads with unmapped pair . HIGH_SPAN_PE high number of MP reads with pair mapped in a diﬀerent contig/scaﬀold . HIGH_OUTIE_PE high number of mis-oriented or too distant MP reads . Table 1.

Description of implemented features. assembler Ctg NG50 Chaﬀ Dupl Comp Indels Misjoins Inv Reloc Sens Spec(Kbp) (%) (%) (%)ABySS 246 34 6.66 23.06 1.05 10 6 4 2 99.25 62.70Allpaths-LG Table 2.

Staphilococcus aureus (GAGE) assembly evaluation and features estimation. For eachassembler we report the number of contigs/scaﬀolds produced (Ctg), the NG50, the percentage of short(Chaﬀ) contigs, the percentage of duplicated (Dupl) and compressed (Comp) regions in the assembly(all the percentages are computed with respect to the real genome length), the number of long ( i.e. , > assembler Ctg NG50 Chaﬀ Dupl Comp Indels Misjoins Inv Reloc Sens Spec(Kbp) (%) (%) (%)ABySS 1701 9 1.59 9.93 0.49 38 24 2 22 98.92 37.26Allpaths-LG

34 3,192 Table 3.

Rhodobacter sphaeroides (GAGE) assembly evaluation and features estimation. For eachassembler we report the number of contigs/scaﬀolds produced (Ctg), the NG50, the percentage of short(Chaﬀ) contigs, the percentage of duplicated (Dupl) and compressed (Comp) regions in the assembly(all the percentages are computed with respect to the real genome length), the number of long ( i.e. , > assembler Ctg NG50 Chaﬀ Dupl Comp Indels Misjoins Inv Reloc Sens Spec(Kbp) (%) (%) (%)ABySS 51301 2,1 34.78 0.48 0.44 762

22 15 7

225 81,647

150 90 60 92.13 65.38SOAPdenovo 13501 455 3.09 5.68 3.19 3902 1529 537 992 90.59 73.10Velvet 3565 1,190 4.23 0.08 0.53 4172 9525 4023 5502 91.60 67.55

Table 4.

Human chromosome 14 (GAGE) assembly evaluation and features estimation. For eachassembler we report the number of contigs/scaﬀolds produced (Ctg), the NG50, the percentage of short(Chaﬀ) contigs, the percentage of duplicated (Dupl) and compressed (Comp) regions in the assembly(all the percentages are computed with respect to the real genome length), the number of long ( i.e. , > assembler Ctg NG50 Chaﬀ Dupl Comp Indels Misjoins Sens Spec(Kbp) (%) (%) (%)BROAD 989

197 95.10 96.55DOEJGI 771 9,073 0.03 0.01 0.84 163

Table 5.

Assemblathon 1 assembly evaluation and features estimation. For each assembler/team wereport the number of contigs/scaﬀolds produced (Ctg), the NG50, the percentage of short (Chaﬀ)contigs, the percentage of duplicated (Dupl) and compressed (Comp) regions in the assembly (all thepercentages are computed with respect to the real genome length), the number of long ( i.e. , >>