A spectral algorithm for fast de novo layout of uncorrected long nanopore reads
MMANUSCRIPT
A spectral algorithm for fast de novo layout ofuncorrected long nanopore reads
Antoine Recanati , Thomas Br ¨uls , , and Alexandre d’Aspremont CNRS & D.I., UMR 8548, ´Ecole Normale Sup ´erieure, Paris, France. Commissariat `a l’Energie Atomique et aux Energies Alternatives, Direction de la RechercheFondamentale, Genoscope, Evry, France. UMR 8030, Centre National de la Recherche Scientifique, Evry, France. Universit ´e Paris-Saclay, Evry, France.
ABSTRACTMotivation:
New long read sequencers promise to transformsequencing and genome assembly by producing reads tens ofkilobases long. However, their high error rate significantly complicatesassembly and requires expensive correction steps to layout the readsusing standard assembly engines.
Results:
We present an original and efficient spectral algorithm tolayout the uncorrected nanopore reads, and its seamless integrationinto a straightforward overlap/layout/consensus (OLC) assemblyscheme. The method is shown to assemble Oxford Nanopore readsfrom several bacterial genomes into good quality ( ∼
99% identity tothe reference) genome-sized contigs, while yielding more fragmentedassemblies from the eukaryotic microbe
Sacharomyces cerevisiae . Availability and implementation: https://github.com/antrec/spectrassembler
Contact: [email protected]
De novo whole genome sequencing seeks to reconstruct an entiregenome from randomly sampled sub-fragments whose order andorientation within the genome are unknown. The genome isoversampled so that all parts are covered multiple times with highprobability.High-throughput sequencing technologies such as Illuminasubstantially reduce sequencing cost at the expense of read length,which is typically a few hundred base pairs long (bp) at best.Yet, de novo assembly is challenged by short reads, as genomescontain repeated sequences resulting in layout degeneracies whenread length is shorter or of the same order than repeat length [Pop,2004].Recent long read sequencing technologies such as PacBio’sSMRT and Oxford Nanopore Technology (ONT) have spurred arenaissance in de novo assembly as they produce reads over 10kbplong [Koren and Phillippy, 2015]. However, their high error rate( ∼ Hybrid techniques combine short and long read technologies:the accurate short reads are mapped onto the long reads, enabling a consensus sequence to be derived for each long read andthus providing low-error long reads (see for example Madouiet al. [2015]). This method was shown to successfully assembleprokaryotic and eukaryotic genomes with PacBio [Koren et al.,2012] and ONT [Goodwin et al., 2015] data.
Hierarchical assembly follows the same mapping and consensus principle but resorts tolong read data only, the rationale being that the consensus sequencederived from all erroneous long reads matching a given position ofthe genome should be accurate provided there is sufficient coverageand sequencing errors are reasonably randomly distributed: for agiven base position on the genome, if 8 out of 50 reads are wrong,the majority vote still yields the correct base. Hierarchical methodsmap long reads against each other and derive, for each read, aconsensus sequence based on all the reads that overlap it. Such anapproach was implemented in HGAP [Chin et al., 2013] to assemblePacBio SMRT data, and more recently by Loman et al. [2015], toachieve complete de novo assembly of
Escherichia coli with ONTdata exclusively.Recently, Li [2016] showed that it is possible to efficientlyperform de novo assembly of noisy long reads in only twosteps, without any dedicated correction procedure: all-vs-all rawread mapping (with minimap) and assembly (with miniasm).The miniasm assembler is inspired by the Celera Assembler andproduces unitigs through the construction of an assembly graph. Itsmain limitation is that it produces a draft whose error rate is of thesame order as the raw reads.Here, we present a new method for computing the layout ofraw nanopore reads, resulting in a simple and computationallyefficient protocol for assembly. It takes as input the all-vs-all overlapinformation ( e.g. from minimap, MHAP [Berlin et al., 2015] orDALIGNER [Myers, 2014]) and outputs a layout of the reads( i.e. their position and orientation in the genome). Like miniasm,we compute an assembly from the all-vs-all raw read mapping,but achieve improved quality through a coverage-based consensusgeneration process, as in nanocorrect [Loman et al., 2015], althoughreads are not corrected individually in our case.The method relies on a simple spectral algorithm akin to Google’sPageRank [Page et al., 1999] with deep theoretical underpinnings,described in § de novo assembly problems.In § a r X i v : . [ q - b i o . GN ] J u l ecanati, Br ¨uls, d’Aspremont §
3, and discusspossible improvements and limitations in § We lay out the reads in two steps. We first sort them by position, i.e., find apermutation π such that read π (1) will be positioned before read π (2) on thegenome. Then, we iteratively assign an exact position (i.e., leftmost basepaircoordinate on the genome) to each read by using the previous read’s positionand the overlap information.The key step is the first one, which we cast as a seriation problem, i.e.we seek to reconstruct a linear order between n elements using unsorted,pairwise similarity information [Atkins et al., 1998; Fogel et al., 2013]. Herethe n elements are the reads, and the similarity information comes from theoverlapper ( e.g. from minimap).The seriation problem is formulated as follows. Given a pairwisesimilarity matrix A ij , and assuming the data has a serial structure, i.e. thatthere exists an order π such that A π ( i ) π ( j ) decreases with | i − j | , seriationseeks to recover this ordering π (see Figure 1 for an illustration). If such anorder π exists, it minimizes the 2-SUM score,2-SUM ( π ) = n (cid:88) i,j =1 A ij ( π ( i ) − π ( j )) , (1)and the seriation problem can be solved as a minimization over the set ofpermutation vectors [Fogel et al., 2013]. In other words, the permutation π should be such that if A ij is high (meaning that i and j have a highsimilarity), then ( π ( i ) − π ( j )) should be low, meaning that the positions π ( i ) and π ( j ) should be close to each other. Conversely, if A ij = 0 , thepositions of i and j in the new order may be far away without affecting thescore.When using seriation to solve genome assembly problems, the similarity A ij measures the overlap between reads i and j . In an ideal setting withconstant read length and no repeated regions, two overlapping reads shouldhave nearby positions on the genome. We therefore expect the order foundby seriation to roughly match the sorting of the positions of the reads.
20 40 60 80 100 120 140 16020406080100120140160
20 40 60 80 100 120 140 16020406080100120140160
Fig. 1: A similarity matrix reordered with the spectral algorithm.The original matrix (left) has values that decrease when movingaway from the diagonal (red : high value, blue : low value). It israndomly permuted (right), and the spectral algorithm will find backthe original ordering.
The problem of finding a permutation over n elements is combinatorial.Still, provided the original data has a serial structure, an exact solution toseriation exists in the noiseless case [Atkins et al., 1998] using spectralclustering, and there exist several convex relaxations allowing explicitconstraints on the solution [Fogel et al., 2013]. The exact solution is directly related to the well-known spectral clusteringalgorithm. Indeed, for any vector x , the objective in (1) reads n (cid:88) i,j =1 A ij ( x i − x j ) = x T L A x, L A = diag ( A ) − A where L A is the Laplacian matrix of A . This means that the 2-SUM problemamounts to min π π T L A π where π is a permutation vector. Roughly speaking, the spectral clusteringapproach to seriation relaxes the constraint “ π is a permutation vector” into“ π is a vector of R n orthogonal to the constant vector = (1 , ..., T ” withfixed norm. The problem then becomes min { T π =0 , (cid:107) π (cid:107) =1 } π T L A π This relaxed problem is an eigenvector problem. Finding the minimumover normalized vectors x yields the eigenvector associated to the smallesteigenvalue of L A , but the smallest eigenvalue, , is associated with theeigenvector , from which we cannot recover any permutation. However,if we restrict x to be orthogonal to , the solution is the second smallesteigenvector, called the Fiedler vector. A permutation is recovered from thiseigenvector by sorting its coefficients: given x = ( x , x , ..., x n ) , thealgorithm outputs a permutation π such that x π (1) ≤ x π (2) ≤ ... ≤ x π ( n ) .This procedure is summarized as Algorithm 1.In fact, [Atkins et al., 1998] showed that under the assumption that A has a serial structure, Algorithm 1 solves the seriation problem exactly, i.e.recovers the order π such that A π ( i ) π ( j ) decreases with | i − j | . This meansthat we solve the read ordering problem by simply solving an extremaleigenvalue problem, which has low complexity (comparable to PrincipalComponent Analysis (PCA)) and is efficient in practice (see SupplementaryFigure S1 and Table S1). Algorithm 1
Spectral ordering
Input:
Connected similarity matrix A ∈ R n × n
1: Compute Laplacian L A = diag ( A ) − A
2: Compute second smallest eigenvector of L A , x ∗
3: Sort the values of x ∗ Output:
Permutation π : x ∗ π (1) ≤ x ∗ π (2) ≤ ... ≤ x ∗ π ( n ) Once the reads are reordered, we can sequentially compute their exactpositions (basepair coordinate of their left end on the genome) andorientation. We assign position 0 and strand “+” to the first read, and usethe overlap information (position of the overlap on each read and mutualorientation) to compute the second read’s position and orientation, etc. Morespecifically, when computing the position and orientation of read i , we usethe information from reads i − , ..., i − c to average the result, where c roughly equals the coverage, as this makes the layout more robust tomisplaced reads. Note that overlappers relying on hashing, such as minimapand MHAP, do not generate alignments but still locate the overlaps on thereads, making this positioning step possible. Thanks to this “polishing”phase, we would still recover the layout if two neighboring reads werepermuted due to consecutive entries of the sorted Fiedler vector being equalup to the eigenvector computation precision, for example. We built a simple assembler using this layout idea and tested its accuracy.It is partly inspired by the nanocorrect pipeline of Loman et al. [2015]in which reads are corrected using multiple alignments of all overlappingreads. These multiple alignments are performed with a Partial Order Aligner(POA) [Lee et al., 2002] multiple-sequence alignment engine. It computesa consensus sequence from the alignment of multiple sequences using a spectral algorithm for fast de novo layout of uncorrected long nanopore reads window 1 window 2 window 3 POA in windows consensus 1 consensus 2 consensus 3consensus (1+2)consensus ((1+2) +3)
Fig. 2: Consensus generation. Given the layout, the genome issliced into overlapping windows, and a consensus is computed ineach window. The final consensus is then obtained by merging theconsensus windows. dynamic programming approach that is efficient when the sequences aresimilar (which is the case if we trim the sequences to align their overlappingparts). Specifically, we used SPOA, a Single Instruction Multiple Dataimplementation of POA developed in Vaser et al. [2016].The key point is that we do not need to perform multiple alignmentusing all reads, since we already have a layout. Instead, we can generatea consensus sequence for, say, the first 3000 bp of the genome by aligningthe parts of the reads that are included in this window with SPOA, and repeatthis step for the reads included in the window comprising the next 3000 bp ofthe genome, etc. In practice, we take consecutive windows that overlap andthen merge them to avoid errors at the edges, as shown in Figure 2. The top ofthe figure displays the layout of the reads broken down into three consecutiveoverlapping windows, with one consensus sequence generated per windowwith SPOA. The final assembly is obtained by iteratively merging thewindow k +1 to the consensus formed by the windows , . . . , k .The computational complexity for aligning N sequences of length L with POA, with an average divergence between sequences (cid:15) , is roughly O ( mNL ) , with m (cid:39) (1 + 2 (cid:15) ) . With of errors, m is close to 1.If each window of size L w contains about C sequences, the complexity ofbuilding the consensus in a window is O ( mCL w ) . We compute L g /L w consensus windows, with L g the length of the genome (or contig), sothe overall complexity of the consensus generation is O ( mCL g L w ) . Wetherefore chose in practice a window size relatively small, but large enoughto prevent mis-assemblies due to noise in the layout, L w = 3 kbp. In practice, we build the similarity matrix A as follows. Given an overlapfound between the i-th and j-th reads, we set A ij equal to the overlap score(or number of matches, given in tenth column of minimap or fourth columnof MHAP output file). Such matrices are sparse: a read overlaps with only a few others (the number of neighbors of a read in the overlap graph roughlyequals the coverage). There is no sparsity requirement for the algorithmto work, however sparsity lowers RAM usage since we store the n × n similarity matrix with about n × C non-zero values, with C the coverage. Insuch cases, the ordered similarity matrix is band diagonal. Fig. 3: Similarity matrix for
E. coli
ONT sequences before (left) andafter (right) thresholding. The positions of the reads were obtainedby mapping to the reference genome with GraphMap [Sovi´c et al.,2016].
Unfortunately, the correctly ordered (sorted by position of the reads onthe backbone sequence) similarity matrix contains outliers outside the maindiagonal band (see Figure 3) that corrupt the ordering. These outliers aretypically caused by either repeated subsequences or sequencing noise (errorin the reads and chimeric reads), although errors in the similarity can alsobe due to hashing approximations made in the overlap algorithm. We usea threshold on the similarity values and on the length of the overlaps toremove them. The error-induced overlaps are typically short and yield alow similarity score ( e.g. , number of shared min-mers), while repeat-inducedoverlaps can be as long as the length of the repeated region. By weighting thesimilarity, the value associated to repeat-induced overlaps can be lowered.Weighting can be done with, e.g ., the --weighted option in MHAP to adda tf-idf style scaling to the MinHash sketch, making repetitive k-mers lesslikely to cause a match between two sequences, or with default parameterswith minimap. In the Supplementary Material, we describe experimentswith real, corrected and simulated reads to assess the characteristics ofsuch overlaps and validate our method. Supplementary Figure S2 showsthat although the overlap scores and lengths are lower for outliers than forinliers on average, the distributions of these quantities intersect. As shownin S3, the experiments indicate that all false-overlaps can be removed witha stringent threshold on the overlap length and score. However, removingall these short or low score overlaps will also remove many true overlaps.For bacterial genomes, the similarity graph can either remain connectedor be broken into several connected components after a threshold-basedoutlier removal, depending on the initial coverage. Figure S3 illustratesthe empirical observation that the coverage needs to be above 60x to keepthe graph connected while removing all outliers. Most outliers can besimilarly removed for real and synthetic data from
S. cerevisiae , althougha few outliers, probably harboring telomeric repeats, remain at the ends ofchromosomes after thresholding.There is thus a tradeoff to be reached depending on how many trueoverlaps one can afford to lose. With sufficient coverage, a stringentthreshold on overlap score and length will remove both repeat-induced anderror-induced overlaps, while still yielding a connected assembly graph.Otherwise, aggressive filtering will break the similarity graph into severalconnected components. In such a case, since the spectral algorithm onlyworks with a connected similarity graph, we compute the layout andconsensus separately in each connected component, resulting in severalcontigs. To set the threshold sufficiently high to remove outliers but smallenough to keep the number of contigs minimal, we used a heuristic based on ecanati, Br ¨uls, d’Aspremont the following empirical observation, illustrated in Supplementary Figure S4.The presence of outliers in the correctly (based on the positions of the reads)ordered band diagonal matrix imparts an increased bandwidth (maximumdistance to the diagonal of non zero entries) on the matrix reordered with thespectral algorithm.We can therefore run the spectral algorithm, check the bandwidth in thereordered matrix, and increase the threshold if the bandwidth appears toolarge (typically larger than twice the coverage).In practice, we chose to set the threshold on the overlap length to3.5kbp, and removed the overlaps with the lowest score [in the first 40%-quantile (respectively 90% and 95%) for C ≤
60X (resp. 60X ≤ C ≤ ≥ i ), the set of itsneighbors in the graph N i = { j : A ij > } . The subgraph represented by A restricted to N i is either connected (there exists a path between any pairof edges), or split into separate connected components. In the latter case, wekeep the overlaps between read i and its neighbor that belong to only one ofthese connected components (the largest one). Algorithm 2
OLC assembly pipeline
Input: n long noisy reads1: Compute overlaps with an overlapper ( e.g. minimap or MHAP)2: Construct similarity matrix S ∈ R n × n from the overlaps3: Remove outliers from S with a threshold on values S ij , on overlaplength, and removal of connecting reads (as explained in § for all Connected component A of S do
5: Reorder A with spectral algorithm (Algorithm 1)6: if bandwidth of A reordered ≥ × Coverage then
7: set higher threshold on A and try again8: end if
9: Compute layout from the ordering found and overlaps10: Partition the length of the contig into small windows11: Compute consensus in each window with SPOA12: Merge consecutive windows with SPOA13: end forOutput:
Contig consensus sequences
We tested this pipeline on ONT and PacBio data. The bacterium
Acinetobacter baylyi ADP1 and the yeast
Saccharomyces cerevisiaeS288C were sequenced at Genoscope with Oxford Nanopore’sMinION device using the R7.3 chemistry, together with anadditional dataset of
S. cerevisiae S288C using the R9 chemistry.Only the 2D high quality reads were used. The
S. cerevisiaeS288C
Acinetobacter baylyi ADP1 sequences will be made availableon https://github.com/antrec/spectrassembler. We also used thefollowing publicly available data: ONT
Escherichia coli by Lomanet al. [2015] (http://bit.ly/loman006 - PCR1 2D pass dataset), and true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) Fig. 4: Ordering of the reads computed with the spectral algorithmvs true ordering (obtained by mapping the reads to the referencegenome with GraphMap) for the
E. coli
ONT dataset. All contigsare artificially displayed on the same plot for compactness. Thereare two equivalent correct orderings for each contig : (1,2,...,n) and(n, n-1, ..., 1), both yielding the same 2-SUM score (1) and leadingto the same consensus sequence (possibly reverse complemented).PacBio
E. coli
K-12 PacBio P6C4, and
S. cerevisiae W303 P4C2 .Their key characteristics are given with the assembly results inTable 1, and read length histograms are given in SupplementaryFigure S5. For each dataset, we also used the reads correctedand trimmed by the Canu pipeline as an additional dataset withlow error-rate. The results on these corrected datasets are given inSupplementary Figures S6 and S7 and Tables S2 and S4. minimap was used to compute overlapsbetween raw reads (we obtained similar results with MHAP andDALIGNER). The similarity matrix preprocessed as detailed inSection2.3 yielded a few connected components for bacterialgenomes. The reads were successfully ordered in each of these, asone can see in Figure 4 for
E. coli , and in Figure S6 for the otherdatasets.
For the
S. cerevisiae genome, thethreshold on similarity had to be set higher than for bacterialgenomes because of a substantially higher number of repetitiveregions and false overlaps, leading to a more fragmented assembly.Most of them are correctly reordered with the spectral algorithm,see Figure 5 and Supplementary Figure S7.
Once the layout was established,the method described above was used to assemble the contigs andgenerate a consensus sequence. For the two bacterial genomes,the first round of layout produced a small number of connectedcomponents, each of them yielding a contig. Sufficient overlap wasleft between the contig sequences to find their layout with a second spectral algorithm for fast de novo layout of uncorrected long nanopore reads true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) Fig. 5: Ordering of the
Saccharomyces cerevisiae
ONT R7.3 readsidentified with the spectral algorithm vs true ordering (obtainedby mapping the reads to the reference genome with GraphMapand concatenating the ordering found in each chromosome). Thedifferent chromosomes are separated by grid lines.iteration of the algorithm and produce a single contig spanning theentire genome. The number of contigs in the yeast assemblies canbe reduced similarly. The fact that the first-pass contigs overlap eventhough they result from breaking the similarity graph into severalconnected components might seem counter-intuitive at first sight.However, note that when cutting an edge A ij results in the creationof two contigs (one containing i and the other j ), the sequencefragment at the origin of the overlap between the two reads is stillthere on both contigs to yield an overlap between them in the seconditeration. Alternatively, we found the following method useful tolink the contigs’ ends: 1. extract the ends of the contig sequences,2. compute their overlap with minimap, 3. propagate the overlapsto the contig sequences, 4. use miniasm with all pre-selectionparameters and thresholds off, to just concatenate the contigs (seeSupplementary Material § We first investigated thequality of the consensus sequences derived in each window.Figures 6 and S8 highlight the correcting effect of the consensus.Supplementary Figure S9 suggests that the error-rate in theconsensus windows depends mainly on the local coverage. Wethen compared our results to those obtained with other long readsassemblers : Miniasm, Canu and Racon [Vaser et al., 2016]. Racontakes a draft assembly, the raw reads, and a mapping of the readsto the draft assembly as input. We used it with the draft assemblyproduced by Miniasm (as done by Vaser et al. [2016]). We labelthis method “Miniasm+Racon” in our results. We also used Raconwith the draft assembly derived by our method (“Spectral+Racon”method), using Minimap to map the raw reads to the draftassemblies before using Racon. A summary of assembly reportsgenerated with DNAdiff [Kurtz et al., 2004] and QUAST [Gurevichet al., 2013] are given in Table 1 and Supplementary Table S3.Briefly, the assemblies displayed between and averageidentity to their reference genome, with errors mostly consisting Fig. 6: Error rate of consensus window sequences, compared to theraw and corrected (with the Canu correction and trimming modules)reads for the
A. baylyi
ONT dataset. The error rates were computedby mapping the sequences to the
A. baylyi reference genome.Histograms for the other datasets are available in SupplementaryFigure S8.in deletions. Misassemblies were rare in reconstructed bacterialgenomes but more frequent in assembled yeast genomes, where theymostly consisted in translocations and relocations caused by eitherdeletions and/or misplaced reads in the layout.
After the first iteration of the bacterialgenome assembly pipeline, overlaps between the first-pass contigswere sufficient to find their layout. It should be anticipated howeverthat not all overlaps might be apparent in some cases, e.g. iftoo many reads were removed during the preprocessing step. Oneattractive option is to use optical mapping [Aston et al., 1999] tolayout the contigs. We had such an optical map available for the
A.baylyi genome, and implemented the algorithm of Nagarajan et al.[2008] to map the contigs to the restriction map, which led to thesame layout as the one identified from our two-round assemblies(data not shown), thus providing a “consistency check” for thelayout. We suggest in Supplementary Figure S10 and Table S5that optical maps could be particularly valuable for the ordering ofcontigs from more structurally complex eukaryotic genomes such as
S. cerevisiae . We have shown that seriation based layout algorithms can besuccessfully applied to de novo genome assembly problems, at leastfor genomes harboring a limited number of repeats.In a similar vein to the recent report about the miniasm assemblyengine [Li, 2016], our work confirms that the layout of long readscan be found without prior error correction, using only overlapinformation generated from raw reads by tools such as minimap,MHAP or DALIGNER. However, unlike miniasm, which does ecanati, Br ¨uls, d’Aspremont Table 1.
Assembly results of the spectral method, compared to Miniasm, Canu and Racon, across the different datasetsMiniasm Spectral Canu Miniasm+Racon Miniasm+Racon(2 iter.) Spectral+Racon
A. baylyi
ONTR7.328x Ref. genome size [bp] 3598621 3598621 3598621 3598621 3598621 3598621Total bases [bp] 3531295 3551582 3513432 3564823 3566438 3551094Ref. chromosomes [
E. coli
ONTR7.330x Ref. genome size [bp] 4641652 4641652 4641652 4641652 4641652 4641652Total bases [bp] 4759346 4662043 4625543 4647066 4643235 4629112Ref. chromosomes [
S.cerevisiae
ONTR7.368x Ref. genome size [bp] 12157105 12157105 12157105 12157105 12157105 12157105Total bases [bp] 11813544 12213218 12142953 11926664 11926191 12167363Ref. chromosomes [
S.cerevisiae
ONT R986x Ref. genome size [bp] 12157105 12157105 12157105 12157105 12157105 12157105Total bases [bp] 11734150 11795644 12217497 12128279 12129086 11750114Ref. chromosomes [
E. coli
PacBio161x Ref. genome size [bp] 4641652 4641652 4641652 4641652 4641652 4641652Total bases [bp] 4845211 4731239 4670125 4653228 4645420 4674460Ref. chromosomes [
S.cerevisiae
PacBio127x Ref. genome size [bp] 12157105 12157105 12157105 12157105 12157105 12157105Total bases [bp] 12266420 12839034 12346258 12070971 12052148 12695031Ref. chromosomes [
For the spectral method, we give the results after contig merging (see § not derive a consensus but instead concatenates the reads into afull sequence, we take advantage of read coverage to producecontigs with a consensus quality on par with that achieved byassembly pipelines executing dedicated error-correction steps. The results of Table 1 appear promising. For example, our assemblercombined with Racon yields among the highest average identitieswith the reference for the ONT datasets. In terms of speedhowever, our pipeline is clearly outperformed by Miniasm, but also spectral algorithm for fast de novo layout of uncorrected long nanopore reads by Miniasm+Racon, the latter improving overall accuracy. Still,compared to approaches implementing error correction steps, wegain significant speed-ups by highly localizing the error correctionand consensus generation processes, which is made possible byknowledge of the layout. We believe that tools such as Miniasmand Racon are implemented in a much more efficient way than ourown, but the layout method itself is efficient (see SupplementaryTable S1) and is known to be scalable as it relies on the samealgorithmic core as Google’s PageRank.The main limitation of our layout algorithm is its sensitivity tooutliers in the similarity matrix, hence the need to remove them ina pre-processing phase. Higher coverage and quality of the inputreads, both expected in the near future, would likely improve therobustness of our pipeline. Still, for eukaryotic genomes, we foundthat some outliers require additional information to be resolved (seeSupplementary FigureS3), which could be provided in the future byextracting topological information from the assembly graph.In the meantime, our pipeline behaves like a draft generatingassembler for prokaryotic genomes, and a first-pass unitigger foreukaryotic genomes. Importantly, the overall approach is modularand can integrate other algorithms to increase layout robustness orconsensus quality, as illustrated here by the integration of Racon asan optional polishing module.Our original contribution here consists in the layout computation.The spectral OLC assembler we built on top of it could be enhancedin many ways. We have shown that the spectral algorithm is suited tofind the layout for bacterial genomes, even though there is room leftfor performance improvements on repeat-rich eukaryotic genomes.For these eukaryotic genomes, it could make sense to usethe spectral algorithm jointly with other assembly engines ( e.g. Miniasm or Canu), to check the consistency of connectedcomponents before they are assembled. Our consensus generationmethod is coarse-grained for now and does not take into accountstatistical properties of ONT sequencing errors. Nevertheless, thethree components (O, L and C) of the method being independent,an external and more refined consensus generation process couldreadily be plugged after the overlap and layout computations tofurther improve results and increase accuracy.
ACKNOWLEDGEMENT
TB would like to thank Genoscope’s sequencing (Laboratoirede S´equenc¸age) and bioinformatics (Laboratoire d’InformatiqueScientifique) teams for sharing some
Acinetobacter baylyi
ADP1and
Sacharomyces cerevisiae
S288C MinION data, and is gratefulto Oxford Nanopore Technologies Ltd for granting Genoscopeaccess to its MinION device via the MinION Access Programme.AA and AR would like to acknowledge support from theEuropean Research Council (project SIPA). The authors would alsolike to acknowledge support from the chaire ´Economie des nouvellesdonn´ees , the data science joint research initiative with the fondsAXA pour la recherche and a gift from Soci´et´e G´en´erale Cross AssetQuantitative Research.
REFERENCES
Aston, C., Mishra, B., and Schwartz, D. C. (1999). Optical mapping and its potentialfor large-scale sequencing projects.
Trends in Biotechnology , 17(7):297–302. Atkins, J. E., Boman, E. G., and Hendrickson, B. (1998). A spectral algorithmfor seriation and the consecutive ones problem.
SIAM Journal on Computing ,28(1):297–310.Atkins, J. E. and Middendorf, M. (1996). On physical mapping and the consecutiveones property for sparse matrices.
Discrete Appl. Math. , 71(1-3):23–40.Berlin, K., Koren, S., Chin, C.-S., Drake, J. P., Landolin, J. M., and Phillippy, A. M.(2015). Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
Nature biotechnology .Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). Julia: A freshapproach to numerical computing.
SIAM Review , 59(1):65–98.Cheema, J., Ellis, T. N., and Dicks, J. (2010). Thread mapper studio: a novel, visual webserver for the estimation of genetic linkage maps.
Nucleic acids research , 38(suppl2):W188–W193.Chin, C.-S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C., Clum,A., Copeland, A., Huddleston, J., and Eichler, E. E. (2013). Nonhybrid, finishedmicrobial genome assemblies from long-read smrt sequencing data.
Nature methods ,10(6):563–569.Fogel, F., Jenatton, R., Bach, F., and d’Aspremont, A. (2013). Convex relaxations forpermutation problems. pages 1016–1024.Goodwin, S., Gurtowski, J., Ethe-Sayers, S., Deshpande, P., Schatz, M. C., andMcCombie, W. R. (2015). Oxford nanopore sequencing, hybrid error correction, andde novo assembly of a eukaryotic genome.
Genome research , 25(11):1750–1756.Gurevich, A., Saveliev, V., Vyahhi, N., and Tesler, G. (2013). Quast: quality assessmenttool for genome assemblies.
Bioinformatics , 29(8):1072–1075.Jones, B. R., Rajaraman, A., Tannier, E., and Chauve, C. (2012). Anges: reconstructingancestral genomes maps.
Bioinformatics , 28(18):2388.Koren, S. and Phillippy, A. M. (2015). One chromosome, one contig: completemicrobial genomes from long-read sequencing and assembly.
Current Opinion inMicrobiology , 23:110–120.Koren, S., Schatz, M. C., Walenz, B. P., Martin, J., Howard, J. T., Ganapathy, G.,Wang, Z., Rasko, D. A., McCombie, W. R., and Jarvis, E. D. (2012). Hybriderror correction and de novo assembly of single-molecule sequencing reads.
Naturebiotechnology , 30(7):693–700.Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., and Phillippy, A. M. (2016). Canu:scalable and accurate long-read assembly via adaptive k-mer weighting and repeatseparation. bioRxiv .Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., andSalzberg, S. L. (2004). Versatile and open software for comparing large genomes.
Genome biology , 5(2):R12.Lee, C., Grasso, C., and Sharlow, M. F. (2002). Multiple sequence alignment usingpartial order graphs.
Bioinformatics , 18(3):452–464.Li, H. (2016). Minimap and miniasm: fast mapping and de novo assembly for noisylong sequences.
Bioinformatics , page btw152.Loman, N. J., Quick, J., and Simpson, J. T. (2015). A complete bacterial genomeassembled de novo using only nanopore sequencing data.
Nat Meth , 12(8):733–735.Madoui, M.-A., Engelen, S., Cruaud, C., Belser, C., Bertrand, L., Alberti, A.,Lemainque, A., Wincker, P., and Aury, J.-M. (2015). Genome assembly usingnanopore-guided long and error-free dna reads.
BMC Genomics , 16:327.Myers, E. W., Sutton, G. G., Delcher, A. L., Dew, I. M., Fasulo, D. P., Flanigan, M. J.,Kravitz, S. A., Mobarry, C. M., Reinert, K. H., and Remington, K. A. (2000). Awhole-genome assembly of drosophila.
Science , 287(5461):2196–2204.Myers, G. (2014).
Efficient local alignment discovery amongst noisy long reads , pages52–67. Springer.Nagarajan, N., Read, T. D., and Pop, M. (2008). Scaffolding and validation of bacterialgenome assemblies using optical restriction maps.
Bioinformatics , 24(10):1229–1235.Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). The pagerank citationranking: Bringing order to the web. Technical report, Stanford InfoLab.Pop, M. (2004). Shotgun sequence assembly.
Advances in computers , 60:193–248.Sovi´c, I., ˇSiki´c, M., Wilm, A., Fenlon, S. N., Chen, S., and Nagarajan, N. (2016).Fast and sensitive mapping of nanopore sequencing reads with graphmap.
Naturecommunications , 7.Vaser, R., Sovic, I., Nagarajan, N., and Sikic, M. (2016). Fast and accurate de novogenome assembly from long uncorrected reads. bioRxiv , page 068122.Yang, C., Chu, J., Warren, R. L., and Birol, I. (2016). Nanosim: nanopore sequenceread simulator based on statistical characterization. bioRxiv , page 044545. spectral algorithm for fast de novo layout of uncorrected long nanopore reads Table S1.
Running time for the different methods on the datasets presented in Section3.1
Spectral Layout Spectral (full,+Minimap) Canu Minimap +Miniasm Racon afterMiniasm Racon afterSpectral
A. baylyi
ONTR7.3 28x Runtime [h:mm:ss] 0:00:23 (0:00:59) 0:12:52 0:25:55 0:00:28 0:01:54 0:01:48Max mem [Gb] 1.966 1.966 3.827 1.499 0.756 0.484
E. coli
ONTR7.3 30x Runtime [h:mm:ss] 0:00:41 (0:01:25) 0:16:15 0:28:40 0:00:13 0:04:36 0:02:14Max mem [Gb] 1.216 1.216 4.655 2.099 0.879 0.645
S. cerevisiae
ONT R7.3 68x Runtime [h:mm:ss] 0:01:41 (0:07:60) 1:41:20 4:33:08 0:01:17 0:21:11 0:21:32Max mem [Gb] 12.208 12.208 4.015 8.506 2.376 2.325
S. cerevisiae
ONT R9 86x Runtime [h:mm:ss] 0:03:38 (0:09:28) 2:26:44 7:15:41 0:02:14 0:23:09 0:22:03Max mem [Gb] 32.928 32.928 3.986 12.397 2.966 2.775
E. coli
PacBio161x Runtime [h:mm:ss] 0:05:19 (0:05:44) 1:32:13 0:51:32 0:01:16 0:16:51 0:18:18Max mem [Gb] 21.650 21.650 3.770 9.969 8.082 4.619
S. cerevisiae
PacBio 127x Runtime [h:mm:ss] 0:03:11 (0:07:01) 2:59:41 1:50:23 0:02:10 0:20:54 0:23:32Max mem [Gb] 32.184 32.184 3.810 16.881 4.290 4.307
Run-time and peak memory for the previously compared methods, when run on a 24 cores Intel Xeon E5-2640 2.50GHz node. Runtime and Max mem correspond to thewall-clock and maximum resident set size fields of the unix /usr/bin/time -v command. The first column (Spectral Layout) displays the running time of the layout phase of ourmethod in the following way: time to reorder contigs with the spectral algorithm (total time to get fine-grained layout); the total time for the layout (including the fine-grainedcomputation of the position of the reads on a backbone sequence) is given between parentheses next to the time for the ordering. The second column gives the runtime forour full pipeline, including running minimap to obtain the overlaps. The runtime for Racon includes the time to map the reads to the backbone sequence with Minimap and torun Racon for the consensus (Racon requires a backbone sequence, obtained either with Miniasm or Spectral in the present experiments). Indeed, the Racon pipeline maps thereads to a draft sequence to get the layout and then computes consensus sequences in windows across the genome. Our pipeline instead directly computes the layout and thengenerates consensus sequences in windows across the genome (the latter task being embarassingly parallel). Canu is faster than our method on the PacBio datasets (probablyat least because because we did not adapt our pipeline (as Canu does) to the much higher coverage, nor to the higher fraction of chimeric reads typical of PacBio data), but noton the ONT datasets. The memory for the spectral method can be allocated among several cores. matrix size (conn. comp.) t i m e t o r e o r d e r ( s ) pythonjulia Fig. S1: Runtime of the spectral ordering algorithm in connected components of different sizes (across all datasets), with two solvers for theeigenvalues computations (scipy.sparse.eigsh and the eigs function from Julia [Bezanson et al., 2017]). We implemented a call to Julia formatrices of size larger than 3000 in the code since its eigenvector computation scales better for large matrices (probably due to the fact thatthe matrices are passed by reference in Julia) but has a non-negligible overhead for small matrices. ecanati, Br ¨uls, d’Aspremont D e n s i t y OutliersInliers (a)
A. baylyi
ONT D e n s i t y OutliersInliers (b)
S. cerevisiae
ONT R7.3 D e n s i t y OutliersInliers (c)
S. cerevisiae
ONT R9 Ovl length D e n s i t y OutliersInliers (d)
A. baylyi
ONT Ovl length D e n s i t y OutliersInliers (e)
S. cerevisiae
ONT R7.3 Ovl length D e n s i t y OutliersInliers (f)
S. cerevisiae
ONT R9Fig. S2: Histograms of overlap scores [number of matches from minimap] (a-c) and overlap lengths (d-f) for the ONT datasets, for outliers(blue) and inliers (green). The x-axis is in log scale. The mapping of the reads to the reference genome with GraphMap was used to labelinliers and outliers. spectral algorithm for fast de novo layout of uncorrected long nanopore reads (a) A. baylyi simu. perfect 104X (b)
A. baylyi simu. raw 104X (c)
S. cerevisiae simu. perfect 87x(d)
A. baylyi sim. perfect 104X (e)
A. baylyi simu. raw 104X (f)
S. cerevisiae simu. perfect 87xFig. S3: Ordered similarity matrices for simulated datasets after removing 50% of the overlaps (a-c) or 90% (d-f) to illustrate the outlierremoval by thresholding on the overlap score. The reads were simulated with NanoSim [Yang et al., 2016], from the
A. baylyi
ONT R7.3 and
S. cerevisiae
ONT R9 datasets. Subfigures S3a and S3d (respectively S3c and S3f) represent the similarity for reads generated with NanoSimfrom the
A. baylyi
ONT R7.3 (respectively
S. cerevisiae
ONT R9) dataset with option –perfect, which means these synthetic reads follow thesame length distribution than the original dataset, but have no errors, and have the coverage specified above. The matrices S3b and S3e weregenerated from the
A. baylyi
ONT R7.3 dataset without the –perfect option, which means they have the same length and error distributionthan the original data, but with higher coverage. For perfect and noisy synthetic
A. baylyi reads and with sufficient coverage, all outlierscould be removed by thresholding while keeping a connected similarity graph (all matrices in the Figure are connected). On the other hand,the similarity matrix generated with
S. cerevisiae perfect reads still harbors a few outliers after removing 90% of the overlaps (with lowestscore). When increasing the threshold value, the connectivity within some individual chromosomes will be broken before all outliers havebeen removed. Additional structural information (as used in Canu or Miniasm) will be required to resolve repeats in such situations. ecanati, Br ¨uls, d’Aspremont (a) bandwidth : 1806 (b)Fig. S4: Similarity matrices containing outliers, displayed with true ordering (obtained by mapping the reads to the reference genome withGraphMap) and generated with a subset of A. baylyi
ONT NanoSim perfect reads S4a, and the same matrix incorrectly reordered with thespectral algorithm S4b. The bandwidth is about 50 times as large as in the absence of outliers. This significant gap (an order of magnitudedifference) between the bandwidth of the matrix reordered with the spectral algorithm depending on whether the original matrix (ordered byincreasing position of the reads) contained outliers ( i.e. , is band-diagonal) or not motivated the development of the heuristic for assessing theordering found by the spectral algorithm, as explained in § read length (Kbp) d e n s i t y (a) A. baylyi
ONT read length (Kbp) d e n s i t y (b) E. coli
ONT read length (Kbp) d e n s i t y (c) E. coli
PacBio read length (Kbp) d e n s i t y (d) S. cerevisiae
ONT R7.3 read length (Kbp) d e n s i t y (e) S. cerevisiae
ONT R9 read length (Kbp) d e n s i t y (f) S. cerevisiae
PacBioFig. S5: Read length histograms of the raw datasets. spectral algorithm for fast de novo layout of uncorrected long nanopore reads Table S2.
Assembly results of several assemblers across the datasets corrected with CanuMiniasm Spectral Canu Miniasm+Racon Miniasm+Racon(2 iter.) Spectral+Racon
A. baylyi
ONTR7.328x(26x) Ref. genome size [bp] 3598621 3598621 3598621 3598621 3598621 3598621Total bases [bp] 3493724 3523055 3516777 3540178 3540766 3522315Ref. chromosomes [
E. coli
ONTR7.330x(27x) Ref. genome size [bp] 4641652 4641652 4641652 4641652 4641652 4641652Total bases [bp] 4597538 4613973 4627578 4617120 4617100 4613521Ref. chromosomes [
S.cerevisiae
ONTR7.368x(38x) Ref. genome size [bp] 12157105 12157105 12157105 12157105 12157105 12157105Total bases [bp] 11814836 11959669 12112186 11877015 11876882 11949674Ref. chromosomes [
S.cerevisiae
ONTR9 86x(40x) Ref. genome size [bp] 12157105 12157105 12157105 12157105 12157105 12157105Total bases [bp] 11946760 12081487 12184545 11970672 11970529 12061759Ref. chromosomes [
E. coli
PacBio161x(38x) Ref. genome size [bp] 4641652 4641652 4641652 4641652 4641652 4641652Total bases [bp] 4642736 4663427 4670125 4642423 4642443 4662179Ref. chromosomes [
S.cerevisiae
PacBio127x(37x) Ref. genome size [bp] 12157105 12157105 12157105 12157105 12157105 12157105Total bases [bp] 12174558 12232964 12346261 12194786 12193481 12217702Ref. chromosomes [
These corrected datasets were obtained by running Canu with the saveReadCorrections=True option on the datasets presented in 3.1. Canu includes correction and trimming, resultingin a removal of short reads and a lower coverage than in the original raw data. However, it is the coverage of the raw dataset which is relevant since higher coverage in the latter willresult in longer reads in the corrected data, even though the coverage in all corrected datasets are roughly below 40x. We indicate the coverage of the corrected datasets in parenthesesnext to the coverage of the original dataset. For the spectral method, we give the results after the contig merging step (see 3.3.1). The number of contigs before this post-processingis given between parentheses. Unlike with raw data, the polishing effect of adding Racon to our pipeline is not significant. All methods have comparable results on the correcteddatasets. The best result in terms of average identity only is indicated in bold (but other metrics should also be used to compare the assemblies). ecanati, Br ¨uls, d’Aspremont true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) (a) A. baylyi
ONT true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) (b) E. coli
ONT true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) (c) E. coli
PacBio true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) (d) A. baylyi
ONT corr. true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) (e) E. coli
ONT corr. true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) (f) E. coli
PacBio corr.Fig. S6: Ordering of the reads computed with the spectral algorithm vs true ordering (obtained by mapping the reads to the referencegenome with GraphMap) for the original (a-c) and corrected (d-f) bacterial datasets. All contigs are artificially displayed on the same plotfor compactness. true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) (a) ONT R7.3 true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) (b) ONT R9 true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) (c) PacBio true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) (d) ONT R7.3 corr. true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) (e) ONT R9 corr. true position . M bp p o s i t i o n f o un d ( s p e c t r a l a l g o . ) (f) PacBio corr.Fig. S7: Ordering of the reads computed with the spectral algorithm vs true ordering (obtained by mapping the reads to the reference genomewith GraphMap) for the original (a-c) and corrected (d-f) yeast ( S. cerevisiae ) datasets. All contigs are artificially displayed on the same plotfor compactness. The dashed lines represent the boundaries between chromosomes. The correction slightly improves the layout for the yeastgenomes. spectral algorithm for fast de novo layout of uncorrected long nanopore reads (a) A. baylyi
ONT
Error rate (%) D e n s i t y Raw reads(Canu) corrected readsConsensus in windows (b)
E. coli
ONT
Error rate (%) D e n s i t y Raw reads(Canu) corrected readsConsensus in windows (c)
E. coli
PacBio
Error rate (%) D e n s i t y Raw reads(Canu) corrected readsConsensus in windows (d)
S. cerevisiae
ONT R7.3
Error rate (%) D e n s i t y Raw reads(Canu) corrected readsConsensus in windows (e)
S. cerevisiae
ONT R9
Error rate (%) D e n s i t y Raw reads(Canu) corrected readsConsensus in windows (f)
S. cerevisiae
PacBioFig. S8: Error-rates in consensus windows, raw reads and corrected reads for the six real datasets. With ONT R7.3 data, the consensusproduced by our pipeline appears more accurate than via the correction module of Canu, while the contrary is true for PacBio data. ecanati, Br ¨uls, d’Aspremont window position (found by mapping to ref.) e rr o r r a t e o f c o n s e n s u s w i n d o w ( % ) (a) A. baylyi
ONT window position (found by mapping to ref.) e rr o r r a t e o f c o n s e n s u s w i n d o w ( % ) (b) S. cerevisiae
ONT R7.3
10 20 30 40 50 60 70 80 coverage in window for consensus e rr o r r a t e o f c o n s e n s u s w i n d o w ( % ) (c) A. baylyi
ONT
10 20 30 40 50 60 coverage in window for consensus e rr o r r a t e o f c o n s e n s u s w i n d o w ( % ) (d) S. cerevisiae
ONT R7.3Fig. S9: Error-rates in consensus windows versus position of the windows on the reference genome (a,b). The dashed lines represent thelocation of repeats for
A. baylyi , and the separation between chromosomes for
S. cerevisiae . The size of each scatter marker is proportionalto the coverage of the window. The (c,d) panel represents the error-rates in consensus windows versus the coverage of the windows. Theerror-rate was computed with the errorrates.py script from samtools, using the mapping obtained from GraphMap. Most of the windowswith a high error rate are positioned at the ends of the contigs they belong to. We also observed that repeats are often positioned at the edgebetween two contigs, though this does not seem to be the determinant factor. The bottom plots represent the error-rate in the windows againsttheir estimated coverage, defined as the total length of sequences used to perform the multiple alignment in the window normalized by thelength of the consensus sequence. Overall, one can see that the windows with high error rate are the ones with low coverage. Nevertheless,especially for the yeast genomes, there are also several windows with high values for both error-rate and coverage. Manual inspection ofthese reveals that they usually do not span repeated regions, but their high error-rates arise from imperfections in the layout. spectral algorithm for fast de novo layout of uncorrected long nanopore reads Table S3.
Misassemblies report of the different assemblers across the various datasetsMiniasm Spectral Canu Miniasm+Racon Miniasm+Racon(2 iter.) Spectral+Racon
A. baylyi
ONTR7.328x Relocations [
E. coli
ONTR7.330x Relocations [
S.cerevisiae
ONTR7.368x Relocations [
S.cerevisiae
ONT R986x Relocations [
E. coli
PacBio161x Relocations [
S.cerevisiae
PacBio127x Relocations [
This report was obtained with QUAST [Gurevich et al., 2013] (only a subset of the report is shown). Given the accuracy of the Miniasm assembly, it is likely that the zeros in theMiniasm column are due to the fact that the algorithm failed to correctly match the sequences, rather than the absence of misassemblies. On all ONT datasets, the Spectral andSpectral+Racon methods are among those yielding the least global misassemblies (relocation, translocation or inversions). ecanati, Br ¨uls, d’Aspremont Table S4.
Misassemblies report of the different assemblers across the datasets corrected with CanuMiniasm Spectral Canu Miniasm+Racon Miniasm+Racon(2 iter.) Spectral+Racon
A. baylyi
ONTR7.328x(26x) Relocations [
E. coli
ONTR7.330x(27x) Relocations [
S.cerevisiae
ONTR7.368x(38x) Relocations [
S.cerevisiae
ONTR9 86x(40x) Relocations [
E. coli
PacBio161x(38x) Relocations [
S.cerevisiae
PacBio127x(37x) Relocations [
This report was obtained with QUAST (only a subset of the report is shown). The number of local misassemblies is smaller than with the uncorrected data, but the number ofglobal ones is not. None of the assemblers has a significantly smaller or larger number of misassemblies compared to the others. spectral algorithm for fast de novo layout of uncorrected long nanopore reads Table S5.
Assembly of each chromosome of
S. cerevisiae (for each chromosome, we used the subset of reads from the
S. cerevisiae
ONT R7.3 dataset thatwere mapped to it). The assembled contigs were evaluated with QUAST and DNAdiff for each chromosome (only a subset of the QUAST descriptive statisticsis shown here). This experiments sheds light on how our method would behave if there were no repeats between chromosomes, or if we knew to whichchromosomes some reads belong to thanks to, e.g., optical mapping.Chr. Ref size [bp] Contigs [ number of RS nu m b e r o f c o n t i g s correctly mappedmis-mapped (a) A. baylyi
ONT contig length (kb) nu m b e r o f c o n t i g s correctly mappedmis-mapped (b) S. cerevisiae
ONT R7.3Fig. S10: Result of an experiment to evaluate the extent to which optical mapping could improve long-range anchoring of the 127
S. cerevisiae
ONT R7.3 contigs and provide an alternative consistency check of the assembly. A restriction map was generated in silico from the reference
S. cerevisiae genome with the BamHI restriction site (GGATCC), yielding one map per chromosome. This simulated optical map representsa best-case scenario since real optical measurements lack some precision and are obtained through an error-prone assembly process. Usingthe same algorithm as for the
A. baylyi genome, we obtained this bar plot showing the number of contigs as a function of the number ofdistinct restriction sites (RS) in their sequence (a) or contig length (b). For a given number of RS occurrences (a) or contig length (b), theblue part of the bar shows the fraction of contigs correctly aligned to the theoretical restriction map, whereas the red part corresponds to thecomplementary fraction of unperfectly aligned contigs. All contigs longer than 60kbp are correctly mapped. ecanati, Br ¨uls, d’Aspremont Implementation and reproducibility
Spectrassembler is implemented in python and available on https://github.com/antrec/spectrassembler with a usage example of how toreproduce the results obtained with
E. coli
ONT data. We used the following software : • SPOA - https://github.com/rvaser/spoa - commit b29e10ba822c2c47dfddf3865bc6a6fea2c3d69b • Minimap - https://github.com/lh3/minimap - commit 1cd6ae3bc7c7a6f9e7c03c0b7a93a12647bba244 • Miniasm - https://github.com/lh3/miniasm - commit 17d5bd12290e0e8a48a5df5afaeaef4d171aa133 • Canu v1.4 - https://github.com/marbl/canu - commit r8037 4ece307bc793c3bc61628526429c224c477c2224 • Racon - https://github.com/isovic/racon - commit e55bb714ef534ae6d076ff657581836f324e0776 • MUMmer’s DNAdiff version 1.2, NUCmer version 3.07 - http://mummer.sourceforge.net/ • QUAST - https://sourceforge.net/projects/quast/files/ • GraphMap - https://github.com/isovic/GraphMap - commit 84f058f92dc5be02022e944dd1d6b9414476432a • errorrates.py from samscripts - https://github.com/isovic/samscripts - commit cd7440fbbffafd76f40b15973c93acbe6111265a • NanoSim - https://github.com/bcgsc/NanoSim - commit 48b9a4c3fcaeff623b9207b7db6d6d88b89a5647SPOA is used in our pipeline for performing multiple sequence alignment. For generating the consensus in windows, it was run withthe options : -l 2 -r 0 -x -3 -o -5 -e -2 (semi-global alignment with custom gap and mismatch penalties). minimap wasrun with options -Sw5 -L100 -m0 -t12 (long reads specific values and multithreading with 12 threads). miniasm was run withdefault parameters when used as a comparative method. Canu was run with saveReadCorrections=True option and data specifications (e.g., genomeSize=3.6m -nanopore-raw ). Racon was run with the alignment generated with minimap (to map the draft assembly, eitherfrom miniasm or from our pipeline) with default parameters. GraphMap [Sovi´c et al., 2016] was used to generate alignment between the readsand the reference genome in order to have the position of the reads and their error rate (which was computed with the script errorrates.py).DNAdiff and QUAST were used to evaluate the assemblies. To concatenate the contigs obtained with our method, we extracted their ends(end length used : 35kbp) and used minimap with options -Sw5 -L500 to compute overlaps between them, and ran miniasm with options -1-2 -e 0 -c 0 -r 1,0 (no pre-selection, no cutting small unitigs, no overlap drop). The related script is available in the tools folder of our GitHubcode. We also publish the other scripts we used (although they may be poorly written and undocumented), including our implementation ofthe optical mapping algorithm of Nagarajan et al. [2008], in the tools folder.to compute overlaps between them, and ran miniasm with options -1-2 -e 0 -c 0 -r 1,0 (no pre-selection, no cutting small unitigs, no overlap drop). The related script is available in the tools folder of our GitHubcode. We also publish the other scripts we used (although they may be poorly written and undocumented), including our implementation ofthe optical mapping algorithm of Nagarajan et al. [2008], in the tools folder.