[PDF] Accurate, Fast and Lightweight Clustering of de novo Transcriptomes using Fragment Equivalence Classes

Abstract

Motivation: De novo transcriptome assembly of non-model organisms is the first major step for many RNA-seq analysis tasks. Current methods for de novo assembly often report a large number of contiguous sequences (contigs), which may be fractured and incomplete sequences instead of full-length transcripts. Dealing with a large number of such contigs can slow and complicate downstream analysis. Results :We present a method for clustering contigs from de novo transcriptome assemblies based upon the relationships exposed by multi-mapping sequencing fragments. Specifically, we cast the problem of clustering contigs as one of clustering a sparse graph that is induced by equivalence classes of fragments that map to subsets of the transcriptome. Leveraging recent developments in efficient read mapping and transcript quantification, we have developed RapClust, a tool implementing this approach that is capable of accurately clustering most large de novo transcriptomes in a matter of minutes, while simultaneously providing accurate estimates of expression for the resulting clusters. We compare RapClust against a number of tools commonly used for de novo transcriptome clustering. Using de novo assemblies of organisms for which reference genomes are available, we assess the accuracy of these different methods in terms of the quality of the resulting clusterings, and the concordance of differential expression tests with those based on ground truth clusters. We find that RapClust produces clusters of comparable or better quality than existing state-of-the-art approaches, and does so substantially faster. RapClust also confers a large benefit in terms of space usage, as it produces only succinct intermediate files - usually on the order of a few megabytes - even when processing hundreds of millions of reads.

Full PDF

AAccurate, Fast and Lightweight Clustering of de novo

Transcriptomes using Fragment Equivalence Classes

Avi Srivastava ∗ , Hirak Sarkar ∗ , Laraib Malik , Rob Patro Department of Computer Science, Stony Brook University Stony Brook, NY 11794-2424

AbstractMotivation : de novo transcriptome assembly of non-model organisms is the ﬁrst major step for manyRNA-seq analysis tasks. Current methods for de novo assembly often report a large number of con-tiguous sequences (contigs), which may be fractured and incomplete sequences instead of full-lengthtranscripts. Dealing with a large number of such contigs can slow and complicate downstream analysis. Results : We present a method for clustering contigs from de novo transcriptome assemblies based uponthe relationships exposed by multi-mapping sequencing fragments. Speciﬁcally, we cast the problem ofclustering contigs as one of clustering a sparse graph that is induced by equivalence classes of fragmentsthat map to subsets of the transcriptome. Leveraging recent developments in efﬁcient read mapping andtranscript quantiﬁcation, we have developed

RapClust , a tool implementing this approach that is capableof accurately clustering most large de novo transcriptomes in a matter of minutes, while simultaneouslyproviding accurate estimates of expression for the resulting clusters. We compare

RapClust againsta number of tools commonly used for de novo transcriptome clustering. Using de novo assemblies oforganisms for which reference genomes are available, we assess the accuracy of these different methodsin terms of the quality of the resulting clusterings, and the concordance of differential expression testswith those based on ground truth clusters. We ﬁnd that

RapClust produces clusters of comparable orbetter quality than existing state-of-the-art approaches, and does so substantially faster.

RapClust alsoconfers a large beneﬁt in terms of space usage, as it produces only succinct intermediate ﬁles — usuallyon the order of a few megabytes — even when processing hundreds of millions of reads.

Availability : RapClust , is implemented in Python, and is available as open source software at https://github.com/COMBINE-lab/RapClust under a BSD with attribution license.

Contact : [email protected]

RNA-seq has proven to be a crucial tool in the study of transcriptomes, allowing scientists to probe dif-ferential expression under various conditions, like response to changing stimuli (27) and disease states (25).It also allows for the discovery of novel alternative splicing events within annotated genes as well as thediscovery of previously unannotated genes. One of the particularly useful properties of high-throughputRNA-seq data is that it can be used to directly assemble the transcriptome of organisms, even when a refer-ence genome is not available. de novo transcriptome assembly solves this problem by assembling RNA-seqreads directly into contigs. Popular assemblers, like Trinity (7), Oasis (21) and Trans-ABySS (20), employa wide range of algorithms and heuristics to provide contigs that approximate the full-length transcriptspresent in the underlying sample. Given the cost and difﬁculty that may be required for genomic assembly, de novo transcriptome assembly offers a compelling alternative when one is interested primarily in studyingan organism’s transcriptome.Despite the advanced techniques employed by many modern de novo transcriptome assemblers, the re-sulting assemblies often contain a large number of contigs which do not represent full-length transcripts.These incomplete contigs may result from fractured assemblies, incomplete assemblies due to lack of cover-age, errant assembly of chimeric transcripts, or a host of other errors (22). Though improved computational ∗ These authors contributed equally to this work. a r X i v : . [ q - b i o . GN ] A p r ethods can reduce the prevalence of such errors, the data itself is often insufﬁcient to guarantee determin-istic recovery of all expressed transcripts. The fractured and incomplete nature of such de novo assembliescan confound downstream analysis. For example, Davidson and Oshlack (5) argue that the large numberof contigs that often result from de novo transcriptome assembly can greatly reduce the statistical power ofdifferential expression analysis. This results, in part, from the need to correct for testing the many addi-tional hypotheses that arise from considering the large number of assembled contigs (which is likely muchgreater than the actual number of transcripts expressed in the sample). Moreover, the sequence-similar con-tigs generated by de novo assemblers typically result in a high fraction of ambiguous, multi-mapping reads.This ambiguity is challenging to resolve, but must be accounted for when performing differential expressionanalysis at the contig level. Graph Construction

Raw Reads de novo A ssembly

Sailﬁsh expression estimation (via Quasi Mapping) r r t t t t r r { t , t } r { t , t } r { t , t , t } { t , t } { t , t , t } t t t t t t t t t t t t tximport Differential Expression (

DESeq2 ) MCLGraph ClusteringEquivalence Class Processing

RapClust

Fragment Equivalence ClassesContig Abundance EstimatesContig Clusters

Figure 1: An overview of the

RapClust pipeline. Fragment equivalence classes are computed using quasi-mapping , and these classes are used for both contig-level expression quantiﬁcation and generation of themapping ambiguity graph. The mapping ambiguity graph is partitioned using

MCL . The resulting clusterscan then be used for downstream analysis (e.g. differential expression).Common pipelines for studying differential gene expression across experimental conditions ﬁrst alignthe RNA-seq reads back to the assembled contigs. Then, they use transcript-level expression estimationtools, such as RSEM (10), to account for the high degree of multi-mapping ambiguity that results from thesubstantial sequence similarity between related contigs. To further improve expression estimates, contigsthat have high similarity (e.g. that are very sequence-similar or that have many overlapping reads aligningbetween them) are clustered together as putative transcripts or contigs of the same gene. Statistical methods,such as those discussed in (9, 23), are then used to identify contigs (or clusters of contigs) that are likelyto be differentially expressed across conditions. Clustering of contigs into putative genes can be a crucialstep in this analysis, since performing differential expression analysis at the level of clusters reduces themultiple hypothesis testing burden, which can be high in de novo transcriptomes, owing to the potentiallylarge number of assembled contigs. Additionally, aggregating contigs into such groups can improve the2obustness of expression estimation and hence the accuracy of differential expression analysis. We notethat, if one requires transcript-level differential expression, such a clustering procedure may not always beuseful. However, for many analyses, it is beneﬁcial.Although clustering may help improve the accuracy of differential expression results, it should be de-signed to account for the multiple sources of sequence similarity that appear in the assembly. For exam-ple, paralogs should ideally be placed in separate clusters, while isoforms of the same gene should be co-clustered. The effects of different clustering approaches in the context of analyzing de novo transcriptomeshas previously been explored in depth, e.g. by Davidson and Oshlack (5), which largely inspired the currentwork. Unfortunately, the approaches tend either to have high computational requirements (mainly due totheir need to align 10s or 100s of millions of reads back to the assembly), or can yield clusters that maypoorly reﬂect the true relationship between contigs and genes (11). In this paper, we present

RapClust , atool for clustering contigs in de novo transcriptome assemblies.

RapClust achieves comparable accuracy tostate-of-the-art transcriptome clustering methods, while being much faster.

RapClust works in conjunctionwith the

Sailfish tool, which is already capable of quickly and accurately producing contig-level abun-dance estimates. It uses the fragment equivalence classes computed by

Sailfish to derive accurate andbiologically meaningful clusters at only a marginal extra cost, beyond what is required for quantiﬁcation.An overview of the

RapClust pipeline is given in Figure 1.

RapClust requires only a transcriptomeassembly, the raw sequencing reads, and a description of the experimental design; it yields an accuratecontig clustering, along with both contig and cluster-level expression estimates.

RapClust is agnostic tothe choice of the underlying de novo assembler, and can be used with popular tools such as Trinity (7) orOases (21).

Quasi-mapping , a recently introduced (26) and fast alternative to read alignment, is used to mapthe reads to the assembled contigs. The multi-mapping structure of the sequencing reads with respect to thetranscriptome is encoded in the form of fragment equivalence classes , as discussed in Section 2.1. Theseequivalence classes induce a mapping ambiguity graph as detailed in Section 2.2 (the notion of the fragmentambiguity graph has previously proven useful e.g. in the context of re-estimating transcript abundances afterupdates to an annotation (19)). After some post-processing, the resulting graph is clustered using

MCL (31).The computed clusters represent a putative contig-to-gene mapping, which can then be used to aggregate thecontig-level expression estimates derived by

Sailfish . These cluster-level expression estimates can thenbe used for downstream analyses like differential expression testing.

In this work, we make explicit the connection between the successful approach of de novo transcriptomeclustering presented in (5), and the concept of equivalence classes over fragments that has enabled, in part,a new class of ultra-fast methods for transcript-level quantiﬁcation from RNA-seq data (3, 17, 18). Thenotion of fragment equivalence classes, as a means of factorizing the likelihood function used in transcript-level quantiﬁcation, was originally introduced by Nicolae et al. (14) and Turro et al. (30) (though a relatedfactorization was used somewhat earlier by Jiang and Wong (8)). Substantial speed improvements wereobtained when traditional alignment of fragments was replaced with much faster procedures like k-mercounting (17), lightweight-alignment (18), pseudoalignment (3) and quasi-mapping (26). All of the ultra-fast transcript quantiﬁcation tools mentioned above use these fast alternatives to alignment together withthe notion of fragment equivalence classes. Recently, Ntranos et al. (16) showed that fragment equivalenceclass counts (called transcript-compatibility counts therein) can be effectively used to cluster cells in single-cell RNA-seq experiments. In our current work, by drawing on these connections, and by focusing on therich information exposed by fragment equivalence classes, we frame the transcript clustering problem in thecontext of the mapping ambiguity graph induced by these equivalence classes. This allows for the rapidclustering of contigs, on the basis of both sequence and expression similarity, using only small intermediatespace. 3 .1 Computing Equivalence Classes via Quasi-mapping

The concept of quasi-mapping , which provides information about the transcripts, positions and orien-tations from which a fragment has possibly originated, but not the base-to-base alignment by which thefragment corresponds to the transcript, has recently been introduced in Srivastava et al. (26). There, it issuggested that quasi-mapping may be adopted as a much-faster alternative to fragment alignment when thebase-to-base alignments are not required for the task being performed. Srivastava et al. describe an efﬁ-cient implementation quasi-mapping in the tool

RapMap , and demonstrate how integrating quasi-mapping in the

Sailfish software for transcript-level quantiﬁcation led to considerable improvements in accuracyand speed. Here, we rely on quasi-mapping to allow for very fast and accurate determination of fragmentequivalence classes, which is crucial to the approach we adopt below. We note that, though quasi-mapping does not compute a nucleotide-level alignment, it is sensitive to even small differences in related referencesequences. Thus, it can accurately map fragments to e.g. the appropriate paralog, even if the fragmentcontains only a single SNP differentiating the alternative reference sequences. We refer the reader to (26)for details of the quasi-mapping concept and the particular algorithm used by

RapMap .In this paper, we rely heavily on the concept of fragment equivalence classes. We deﬁne an equivalencerelation over fragments, based on the set of transcripts to which they map. The set of fragments relatedunder this deﬁnition constitutes a fragment equivalence class. Let M ( f i ) be the set of transcripts to whichfragment f i maps, and let M ( f j ) be the set of transcripts to which fragment f j maps. We say that f i ∼ f j if and only if M ( f i ) = M ( f j ) . Consequently, a fragment equivalence class is a set of fragments such that,for every pair f i and f j in the class, f i ∼ f j . An equivalence class can be uniquely labeled based on the setof transcripts to which the fragments contained in this class map. We deﬁne the label of equivalence class [ f i ] = { f j ∈ F | f j ∼ f i } as lab ([ f i ]) . It is important to remember that, though the label consists of transcriptnames, the equivalence relation itself is deﬁned over sequenced fragments and not transcripts. Finally, inaddition to a label, we denote the count of each equivalence class C i by count ( C i ) ; this is simply the numberof equivalent fragments in C i . The fragment equivalence classes, as described above, are already computed internally by

Sailfish (17).We have modiﬁed

Sailfish to write these equivalence classes to a ﬁle once quantiﬁcation is complete (thisbehavior is enabled with the –dumpEq ﬂag). This yields, for each sample , a collection of equivalence classes,along with their associated labels and counts. To construct the complete mapping ambiguity graph of the experiment we need to aggregate these equivalence classes over all of the processed samples. In fact, thisaggregation is relatively simple since the labels of fragment equivalence classes are deterministic and stableacross samples (i.e. they depend only on the underlying transcriptome). We generate a single collection C of equivalence classes by merging the classes C , . . . , C M , from all samples, where M is the number ofsamples. Here, C contains the union of equivalence classes from C , . . . , C M , and classes that appear inmore than one sample of C , . . . , C M simply have their respective read count summed. The time and spacerequirement for this aggregation algorithm is linear in the size of input.For a given experiment, the collection C = { C , C , . . . , C k } of equivalence classes induces a weighted,undirected, mapping ambiguity graph G = ( V , E ) . Here, V = T , where T is the set of transcripts in theoriginal transcriptome — and E = {{ t i , t j } | ∃ C ‘ ∈ C where { t i , t j } ⊆ lab ( C ‘ ) } — that is t i and t j areconnected by an edge if and only if they both appear in the label of at least one equivalence class. For agiven edge { t i , t j } , its weight is given by w ( t i , t j ) = N ij / min ( N i , N j ) , where N i j = ∑ C ‘ ∈ C |{ t i , t j }⊆ lab ( C ‘ ) count ( C ‘ ) , N i = ∑ C ‘ ∈ C | t i ∈ lab ( C ‘ ) count ( C ‘ ) and N j = ∑ C ‘ ∈ C | t j ∈ lab ( C ‘ ) count ( C ‘ ) . While the samples we process from a de novo

RNA-seq experiment may contain 10s to 100s of millionsof fragments, the number of nodes in G is determined by the number of contigs. Further, the number of4dges is bounded by the complexity of the transcriptome (i.e. the degree of alternative splicing and paralogyin the underlying transcriptome), and, therefore, mostly independent of the number of fragments processed.For the transcriptomes and samples we analyze in this paper, the number of equivalence classes neverrises above a few hundred thousand, and is typically orders of magnitude smaller than the number of frag-ments (see table 1 for the number of equivalence classes in the different data sets). Thus, if the equivalenceclasses can be computed efﬁciently from the transcriptome and sequencing data, then the mapping ambiguitygraph can be constructed efﬁciently in terms of time and space. Once the mapping ambiguity graph G is constructed, it is ﬁltered, as described below, to yield a graph G . G is then clustered using an off-the-shelf graph clustering algorithm. Currently, RapClust employs twosimple ﬁlters. The ﬁrst ﬁlter removes nodes from G that have fewer than some nominal threshold of readsupport over all samples in an experiment. We adopt the cutoff used by (5), and remove any contig with 10or fewer mapped reads from the mapping ambiguity graph.The second ﬁlter, also adopted from (5), is used to remove edges between pairs of contigs that are likelyto arise from paralogous genes. Speciﬁcally, this ﬁlter tests the hypothesis that the constant of proportionalitybetween the number of reads mapping to t i and t j does not vary (by a statistically signiﬁcant amount)across conditions. This is done by testing the hypothesis ( H ) that the constant of proportionality remainsconstant across conditions versus the hypothesis ( H ) that it does not. The log-likelihoods of the competinghypotheses H and H are computed according to eqs. (1) and (2) respectively. A likelihood ratio test isperformed, and edges { t i , t j } from G where 2 ( ‘ − ‘ ) >

20 are removed (we refer the reader to (5) for ajustiﬁcation of the cutoff used in the likelihood ratio test). This test, of course, makes some simplifyingassumptions, since isoforms of the same gene might exhibit behavior consistent with H (e.g. if isoformswitching occurs between conditions). However, we found that this ﬁlter correctly separates transcriptsfrom paralogous genes more often than it incorrectly separates transcripts of the same gene. Thus, applyingthis ﬁlter leads to a slight increase in RapClust ’s precision and a typically smaller decrease in its recall. ‘ = ∑ c (cid:2)(cid:0) X ci · log (cid:0) r i j µ cj (cid:1)(cid:1) − (cid:0) r i j µ cj (cid:1)(cid:3) + (cid:2)(cid:0) X ci · log (cid:0) µ cj (cid:1)(cid:1) − (cid:0) µ cj (cid:1)(cid:3) (1) ‘ = ∑ c (cid:2)(cid:0) X ci · log (cid:0) r ci j X cj (cid:1)(cid:1) − (cid:0) r ci j X cj (cid:1)(cid:3) + (cid:2)(cid:0) X ci · log (cid:0) X cj (cid:1)(cid:1) − (cid:0) X cj (cid:1)(cid:3) (2)where r ci j = X ci X cj , r i j = ∑ c X ci ∑ c X cj , and µ cj = X ci + X cj + r i j , and X ci denotes the number of reads mapping to contig i under the j th condition (summed over allreplicates of a condition for simplicity).After applying both of these ﬁlters in sequence, we obtain the ﬁnal processed graph G , which is thenclustered using MCL (31). For the sake of simplicity, and to avoid an unnecessary dependence on parameters,we generated all the clusterings in this paper using

MCL ’s default parameters, and applying no additionalcutoff or modiﬁcation to the edge weights.

To analyze the performance of

RapClust , we have benchmarked its running time, space usage, andaccuracy against

Corset (5) and

CD-HIT (6, 11) (here, we consider

CD-HIT -EST ). Note that

CD-HIT doesnot provide any quantiﬁcation results. Therefore, in Section 3.3, we considered the clustering computed by

CD-HIT , but estimated the expression of those clusters by aggregating the contig-level expression estimates5omputed by

Sailfish . Testing was performed on 3 datasets, human primary lung ﬁbroblast samples, withand without a small interfering RNA (siRNA) knock down of HOXA1 (29) (Gene Expression Omnibus ac-cession GSE37704), yeast grown under batch and chemostat conditions (15) (Sequence Read Archive (SRA)accessions SRR453566 to SRR453571) and male and female chicken embryonic tissue (1) (SRA accessionSRA055442). For each of these data sets, we performed clustering of Trinity (7) de novo assemblies, whichwere obtained from (4).All experiments were performed on a 64-bit Linux server, running Ubuntu 14.04, with 4 hexacore IntelXeon E5-4607 v2 CPUs (with hyper-threading) running at 2.60GHz and 256GB of RAM. Wall-clock timewas recorded using the Unix time command. Table 1 gives a brief description of the input data.Table 1: Summary statistics for the transcriptomes and experimental samples on which the experimentswere carried out, as well as the average number of fragment equivalence classes and the size of the resultingmapping ambiguity graph. Yeast Human Chicken ,

389 335 , ∼ , , ∼ , , ∼ , , ,

535 222 , ,

481 2 , , Here, we report, for each clustering method, the time required to perform the clustering as well as thetotal size of the intermediate and result ﬁles written to disk. It is important to note that, unlike

RapClust and

Corset , CD-HIT neither counts reads nor performs quantiﬁcation. In order to use the clusters resultingfrom

CD-HIT in a typical analysis (e.g. quantiﬁcation and differential expression testing), one would needto either align reads to the

CD-HIT clusters, or perform quantiﬁcation on these clusters using the sequencedreads, both of which would add to the time and disk space required.For each transcriptome, the time reported for

Corset is the sum of the time required to trim the readsusing

Trimmomatic (2), align the reads using

Bowtie , and cluster the contigs using the resulting

BAM ﬁles(this adopts the protocol and parameters suggested in the

Corset documentation). For

RapClust , the timereported is the time required to run

Sailfish (v0.9.1) on all the samples, plus the time required to generate,Table 2:

RapClust is substantially faster than Corset , and requires only a small fraction of the intermediatedisk space used by

Corset . The majority of space savings for

RapClust come as a result of avoidingalignment and generation of the intermediate

BAM ﬁles. In addition to that, the percentage of reads mappedusing

RapClust is much higher. Originally, the percentage of mapped reads for

RapClust was even higherthan what is reported here, but we subsequently modiﬁed our default behavior to discard orphan mappingsto be consistent with the behavior of the alignments provided to

Corset .Yeast Human Chicken

RapClust Corset RapClust Corset RapClust Corset

Time(min) 5.12 37.25 22.67 211.67 64.18 453Space(Gb) 0.005 5.7 0.092 22 0.49 145% of reads 88.17 62.32 93.04 77.94 88.80 60.996able 3: The overall time taken for

Corset and

RapClust are dominated by the time taken for alignmentand quantiﬁcation respectively. However, when we consider just the time required for clustering (i.e. afteralignments have been generated for

Corset and after the fragment equivalence classes have been generatedfor

RapClust ), we observe that

RapClust and

CD-HIT are considerably faster than

Corset . As

Corset ’sclustering phase is single-threaded, we provide times for all methods in this table using only a single thread.(RC =

RapClust , CD =

CD-HIT , CT =

Corset )Yeast Human ChickenRC CD CT RC CD CT RC CD CTTime(min) 0.04 0.2 2.8 0.82 4.02 16.25 5.29 36.5 87ﬁlter and cluster the mapping ambiguity graph.

Sailfish is run with the –dumpEq option to write to diskthe equivalence classes computed during the quantiﬁcation of each sample.

Trimmomatic , Bowtie , and

Sailfish were each run with 4 threads. For

CD-HIT , only the time required to run

CD-HIT is reported.To determine the disk space required for analysis of each transcriptome using

Corset , we sum the size ofthe

BAM ﬁles produced by

Bowtie and the “count” and “cluster” ﬁles produced by

Corset . For

RapClust ,we determined the required intermediate disk space by summing the sizes of the

Sailfish quantiﬁcationdirectories and the “graph” and “cluster” ﬁles. Since both methods require the same input in terms ofthe assembled transcriptome and set of reads, we don’t count these toward the space requirements. Thecomplete timing results (from assembly and raw reads to computed clusters) for

RapClust and

Corset arepresented in Table 2. The times required for just the clustering steps of

RapClust and

Corset , as wellas the time required by

CD-HIT , are reported in Table 3. We choose to report these results separately tohighlight the fact that, since

CD-HIT only performs clustering, if one wishes to carry out expression analysison the clusters computed by

CD-HIT , she would additionally have to either align the sequenced fragmentsor perform contig-level abundance estimation on the samples, which could take much longer.

Precision Recall F1 S c o r e ( % ) .

20 63 .

98 46 . .

61 62 .

61 45 . .

24 13 .

22 21 . RapClustCorsetCD-HIT (a) Accuracy for yeast

Precision Recall F1 S c o r e ( % ) .

25 58 .

17 72 . .

73 55 .

89 70 . .

15 13 .

65 23 . (b) Accuracy for human Precision Recall F1 S c o r e ( % ) .

77 95 .

62 97 . .

26 91 .

13 95 . .

36 7 .

11 13 . (c) Accuracy for chicken Figure 2: The precision, recall, and F1-score of the

RapClust , Corset and

CD-HIT based clusterings, withrespect to ground-truth annotations, on the yeast (a), human (b) and chicken (c) data sets.We assessed the quality of the clusters obtained by the various tools using two different metrics. Theground truth labels for de novo assembled contigs were taken from (4), and the process of obtaining theselabels is described in (5). It is important to note that not all contigs can be labeled with an annotated gene,and unlabeled contigs were omitted when computing the metrics below.First, we considered the precision and recall metrics used by Davidson and Oshlack (5). Here, pairs ofcontigs are classiﬁed based on whether their cluster labels match their annotated gene labels. Speciﬁcally, a7able 4: The variation of information between contig to gene mapping using genome-based mapping ap-proach and the clusters generated using

RapClust , Corset and

CD-HIT . For each assembly, the clusteringproducing the lowest variation of information with respect to the true clustering is set in bold;

RapClust achieves the lowest VI on all assemblies.

RapClust Corset CD-HIT

Chicken = TP / TP + FP and Recall = TP / TP + FN . A higher precision signiﬁesthat, when contigs are co-clustered, they are more likely to have originated from the same underlying gene.A higher recall, on the other hand, suggests that more contigs originating from the same gene tend tobe co-clustered. Typically, precision and recall are competing objectives, as over clustering will improverecall but harm precision while under clustering will improve precision but harm recall. Commonly, theF1-Score = ( Precision · Recall / Precision + Recall ) is used as a single metric to summarize performance in terms ofboth the precision and recall. Figure 2 shows the accuracy of all three tools on the three assemblies interms of Precision, Recall and F1-Score. Corset and

RapClust generate similar clusters, with

RapClust generally yielding slightly higher recall than

Corset . CD-HIT , however, tends to do a much poorer job attrading off between these competing objectives. It provides similar (or, in case of the yeast data, higher)precision to

Corset and

RapClust , but the resulting clusters exhibit much lower recall.Second, we considered how similar the clusters obtained using the different methods are to the groundtruth clustering, which groups together all contigs labeled with the same gene. To assess this similarity, weused the variation of information (VI) (13). The VI is deﬁned over a pair of clusterings, and quantiﬁes theinformation lost and gained when moving from one clustering to the other. It allows one to measure howsimilar two clusterings are, regardless of the speciﬁc names or labels chosen for the clusters. The lower theVI between a pair of clusters, the more similar they are. To compute the VI between the true clustering C T and that obtained by a particular method C M , we discarded all contigs in C M that are not labeled with a genename, while retaining the clustering relations among the remaining contigs. If any contigs that correspondto annotated genes do not exist in C M (since they may be discarded, e.g., by the read count ﬁlter describedin section 2.3), we considered them to come from a single cluster, which is given a new label. We calledthe resulting clustering C M . Hence, C T and C M are deﬁned over the same set of contigs, and the similaritybetween them can be computed directly using the VI. The VI results are presented in Table 4. RapClust and

Corset seem to yield similar results, with

RapClust obtaining a slightly lower (better) VI. However, weobserved a marked difference between these two and

CD-HIT , whose clusters exhibit a substantially largerVI, especially on the human and chicken data.

We also tested the ability to recover gene-level differential expression using the clusterings produced bythe different methods.

De novo transcriptome assemblers have a tendency to produce many (often fractured)contigs. This tends to confound downstream differential expression analyses, due, in part, to the difﬁcultyof quantifying fractured or incorrectly assembled contigs, and due, in part, to the potentially large numberof extra statistical tests being performed (that must be corrected for). By estimating expression, and per-8orming differential expression testing at the cluster level, one might expect to simultaneously reduce themultiple hypothesis testing burden and “average out” some of the mistakes made in contig-level abundanceestimation.For gene-level differential expression analysis, we compared the genes called as differentially expressedunder each of the different clusterings versus the genes detected as differentially expressed under the truecontig-to-gene mapping (again, note that not all contigs are annotated with a gene label). We generated“ground truth” gene-level abundance estimates based on the contig-level abundance estimates computedby

Sailfish , and the true contig-to-gene mapping. Using the tximport (24) R package, we loaded the

Sailfish expression estimates, aggregated them to the gene level, and prepared them for use with DE-seq2 (12). A 2-condition differential expression test was performed in each data set; for human this was inbetween the conditions with and without the HOX1A knockdown, for yeast it was in the batch and chemo-stat growth conditions, and for chicken it was in the male and female samples (we collapsed the differenttissues within each sex). We then obtained corrected p-values for the hypothesis that each gene is differen-tially expressed across the conditions we considered. We considered genes having a corrected p-value lessthan or equal to 0 .

05 as differentially expressed. We estimated differential expression under the

RapClust and

CD-HIT clusterings in the same manner, where the quantiﬁcation estimates were held ﬁxed, but thetrue contig-to-gene mapping was replaced with the contig-to-cluster mapping produced by these tools. For

Corset , a count matrix is directly provided that was used as input to DESeq2.As a metric of comparison, we examined the rate at which true positive differentially expressed geneswere recovered versus the rate at which false positive differentially expressed genes were called. Eachpredicted cluster was labeled with the union of all of the genes with which its constituent contigs werelabeled. We sorted the list of clusters by p-value, and processed them in the following manner: when weencountered a cluster, we intersected its set of labeled genes with the set of truly differentially expressedgenes. Any genes that had not already been encountered in a more highly-ranked cluster were consideredas true positive predictions. Likewise, any genes that appeared in the cluster, but which did not occur inthe true set of differentially expressed genes were considered as false positives (if the genes had not alreadybeen encountered in a more highly-ranked cluster). We stopped processing the clusters once their adjustedp-value exceeded 0 .

05 (since, under common threshold, such clusters would likely not be considered to bedifferentially expressed). For the true and false positive predictions we encountered, the associated negativep-value of the associated cluster was used as the corresponding “score”. The ROC curves were generatedusing the CROC (28) Python package.

False Positive Rate T r ue P o s i t i v e R a t e RapClust (AUC = 0.89)CDHIT (AUC = 0.904)Corset (AUC = 0.863)

Yeast

False Positive Rate T r ue P o s i t i v e R a t e RapClust (AUC = 0.847)CDHIT (AUC = 0.846)Corset (AUC = 0.832) human

False Positive Rate T r ue P o s i t i v e R a t e RapClust (AUC = 0.534)CDHIT (AUC = 0.463)Corset (AUC = 0.14)

Chicken

Figure 3: ROC curves showing the recovery rate of

RapClust , Corset , and

CD-HIT ’s clusters in recoveringdifferentially expressed genes in each data set. Overall, poor precision or recall in terms of clustering may lead to detection of fewer truly differentially The rather poor performance of

Corset , under this particular test, the Chicken assembly seems not to be the result of a poorclustering, and is explored further in Table S1 and Figure S1.

RapClust recovers true positives versus false positivesis higher than that of

CD-HIT in 2 of the 3 assemblies, and is higher than that of

Corset in all the assemblies,as represented by the respective area under the curves. The beneﬁt of

RapClust is particularly apparent inthe chicken assembly. Since the quantiﬁcation results produced by

Sailfish tend to be reasonable, evenwhen the clustering is fairly poor, this may explain the relatively good performance of the

CD-HIT clusteringin yeast (relative to the rather poor quality of the clustering, as evaluated in Section 3.2).

We have presented a fast and accurate methodology for the data-driven clustering of de novo tran-scriptome assemblies. By making explicit the connection of the method of

Corset (5) and the fragmentequivalence classes that have recently been proven useful in the development of accurate and fast RNA-seq quantiﬁcation tools (3, 17, 18), we have demonstrated how state-of-the-art transcript clustering resultscan be obtained much more quickly than is possible with existing tools. Furthermore, when working di-rectly in the compact and efﬁcient representation of fragment equivalence classes, this clustering requiresonly a marginal computational and storage cost beyond what is already required for quasi-mapping -basedtranscript-level quantiﬁcation. We have implemented this approach in the open-source software

RapClust .There are many interesting directions for future work on this problem. Here, we mention only themost obvious. First, we believe that the quality of the resulting clusters could be improved through a data-driven selection of the appropriate cutoff parameters. Currently, we have directly adopted the parameterssuggested in (5), which are selected partly independent of the underlying data. Preliminary experimentshave suggested that one might be able to obtain substantially better clusters by selecting the transcriptcount cutoff, an edge-weight cutoff, and the log fold-change likelihood cutoff in a manner that is moredata-adaptive. However, automatically selecting these cutoffs in a general, yet rigorous fashion is a topicfor future work. Another potential improvement on the current methodology would be to adopt a morerobust log fold-change test, that may be more accurate in separating contigs that do not originate fromthe same gene.

Sailfish is capable of producing not only transcript-level abundances, but also estimatesof the variance of each predicted abundance via posterior Gibbs sampling or bootstraps. This varianceinformation can be incorporated into the estimates of log fold-change differences to allow for increasedprecision in separating potential paralogs. While the existing method works well in the completely de novo context (i.e. even when the genomes or transcriptomes of closely related organisms may not be available),integrating homology information, when available, has the potential to improve the clustering results (andprovide meaningful biological annotations for the clusters). The best way to integrate this information is anexciting direction for future work. Finally, we believe that sequence-level comparison and analysis of theresulting clusters of contigs could reveal important information about the nature of the transcripts presentin the samples. For example, one could imagine “reverse-engineering” the splicing patterns present in thetranscripts occurring in the same cluster. This would allow one to build a virtual gene model, even in thecompletely de novo context, which could then be used downstream, such as for differential splicing analyses.

References [1] Katie L Ayers, Nadia M Davidson, Diana Demiyah, Kelly N Roeszler, Frank Grützner, Andrew H Sinclair, Alicia Oshlack, and Craig A Smith. Rna sequencingreveals sexually dimorphic gene expression before gonadal differentiation in chicken and allows comprehensive annotation of the w-chromosome.

Genomebiology , 14(3):1–17, 2013.[2] Anthony M Bolger, Marc Lohse, and Bjoern Usadel. Trimmomatic: a ﬂexible trimmer for illumina sequence data.

Bioinformatics , page btu170, 2014.[3] Nicolas Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. Near-optimal rna-seq quantiﬁcation. arXiv preprint arXiv:1505.02710 , 2015.[4] Nadia Davidson and Alicia Oshlack. Data from ‘corset: enabling differential gene expression analysis for de novo assembled transcriptomes’, 2014. URL http://dx.doi.org/10.5281/zenodo.11134 .[5] Nadia M Davidson and Alicia Oshlack. Corset: enabling differential gene expression analysis for de novo assembled transcriptomes.

Genome Biol. , 15(7):410,2014.[6] Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. Cd-hit: accelerated for clustering the next-generation sequencing data.

Bioinformatics , 28(23):3150–3152, 2012.

7] Brian J Haas, Alexie Papanicolaou, Moran Yassour, Manfred Grabherr, Philip D Blood, Joshua Bowden, Matthew Brian Couger, David Eccles, Bo Li, MatthiasLieber, et al. De novo transcript sequence reconstruction from rna-seq using the trinity platform for reference generation and analysis.

Nature protocols , 8(8):1494–1512, 2013.[8] Hui Jiang and Wing Hung Wong. Statistical inferences for isoform expression in rna-seq.

Bioinformatics , 25(8):1026–1032, 2009.[9] Vanessa M Kvam, Peng Liu, and Yaqing Si. A comparison of statistical methods for detecting differentially expressed genes from rna-seq data.

American journalof botany , 99(2):248–256, 2012.[10] Bo Li and Colin N Dewey. Rsem: accurate transcript quantiﬁcation from rna-seq data with or without a reference genome.

BMC bioinformatics , 12(1):1, 2011.[11] Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Bioinformatics , 22(13):1658–1659, 2006.[12] Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for rna-seq data with deseq2.

Genome biology , 15(12):1–21, 2014.[13] Marina Meil˘a. Comparing clusterings by the variation of information. In

Learning theory and kernel machines , pages 173–187. Springer, 2003.[14] Marius Nicolae, Serghei Mangul, Ion I M˘andoiu, and Alex Zelikovsky. Estimation of alternative splicing isoform frequencies from rna-seq data.

Algorithms forMolecular Biology , 6(1):1, 2011.[15] Intawat Nookaew, Marta Papini, Natapol Pornputtapong, Gionata Scalcinati, Linn Fagerberg, Matthias Uhlén, and Jens Nielsen. A comprehensive comparison ofrna-seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in saccharomyces cerevisiae.

Nucleic acids research , 40(20):10084–10097, 2012.[16] Vasilis Ntranos, Govinda M. Kamath, Jesse Zhang, Lior Pachter, and David N. Tse. Fast and accurate single-cell rna-seq analysis by clustering of transcript-compatibility counts. bioRxiv , 2016. doi: 10.1101/036863. URL http://biorxiv.org/content/early/2016/03/04/036863 .[17] Rob Patro, Stephen M Mount, and Carl Kingsford. Sailﬁsh enables alignment-free isoform quantiﬁcation from rna-seq reads using lightweight algorithms.

Naturebiotechnology , 32(5):462–464, 2014.[18] Rob Patro, Geet Duggal, and Carl Kingsford. Accurate, fast, and model-aware transcript expression quantiﬁcation with salmon. bioRxiv , page 021592, 2015.[19] Adam Roberts, Lorian Schaeffer, and Lior Pachter. Updating rna-seq analyses after re-annotation.

Bioinformatics , 29(13):1631–1637, 2013.[20] Gordon Robertson, Jacqueline Schein, Readman Chiu, Richard Corbett, Matthew Field, Shaun D Jackman, Karen Mungall, Sam Lee, Hisanaga Mark Okada,Jenny Q Qian, et al. De novo assembly and analysis of rna-seq data.

Nature methods , 7(11):909–912, 2010.[21] Marcel H Schulz, Daniel R Zerbino, Martin Vingron, and Ewan Birney. Oases: robust de novo rna-seq assembly across the dynamic range of expression levels.

Bioinformatics , 28(8):1086–1092, 2012.[22] Richard D Smith-Unna, Chris Boursnell, Rob Patro, Julian M Hibberd, and Steven Kelly. Transrate: reference free quality assessment of de-novo transcriptomeassemblies.

BioRxiv , page 021626, 2015.[23] Charlotte Soneson and Mauro Delorenzi. A comparison of methods for differential expression analysis of rna-seq data.

BMC bioinformatics , 14(1):1, 2013.[24] Charlotte Soneson, Michael I Love, and Mark D Robinson. Differential analyses for rna-seq: transcript-level estimates improve gene-level inferences.

F1000Research , 4, 2015.[25] Illiassou Hamidou Soumana, Christophe Klopp, Sophie Ravel, Ibouniyamine Nabihoudine, Bernadette Tchicaya, Hugues Parrinello, Luc Abate, Stéphanie Rialle,and Anne Geiger. Rna-seq de novo assembly reveals differential gene expression in glossina palpalis gambiensis infected with trypanosoma brucei gambiensevs. non-infected and self-cured ﬂies.

Frontiers in microbiology , 6, 2015.[26] Avi Srivastava, Hirak Sarkar, Nitish Gupta, and Rob Patro. Rapmap: A rapid, sensitive and accurate tool for mapping rna-seq reads to transcriptomes. bioRxiv ,page 029652, 2016.[27] Chris J. Stubben, Soﬁya N. Micheva-Viteva, Yulin Shou, Sarah K. Buddenborg, John M. Dunbar, and Elizabeth Hong-Geller. Differential expression of smallRNAs from Burkholderia thailandensis in response to varying environmental and stress conditions.

BMC Genomics , 15(1):1–18, 2014. ISSN 1471-2164.[28] S Joshua Swamidass, Chloé-Agathe Azencott, Kenny Daily, and Pierre Baldi. A croc stronger than roc: measuring, visualizing and optimizing early retrieval.

Bioinformatics , 26(10):1348–1356, 2010.[29] Cole Trapnell, David G Hendrickson, Martin Sauvageau, Loyal Goff, John L Rinn, and Lior Pachter. Differential analysis of gene regulation at transcriptresolution with rna-seq.

Nature biotechnology , 31(1):46–53, 2013.[30] Ernest Turro, Shu-Yi Su, Ângela Gonçalves, LJ Coin, Sylvia Richardson, Alex Lewin, et al. Haplotype and isoform speciﬁc expression estimation usingmulti-mapping rna-seq reads.

Genome Biol , 12(2):R13, 2011.[31] Stijn van Dongen.

Graph Clustering by Flow Simulation . PhD thesis, University of Utrecht, 2000. upplemental Materials: Accurate, Fast and Lightweight Clustering of de novo Transcriptomes using Fragment Equivalence Classes

Table S1: The number of contigs retained, and corresponding number of clusters generated, by each of themethods tested for each data set. The number of contigs retained by

RapClust , and the number of clustersit produces, generally reside between the corresponding values of

Corset and

CD-HIT .Chicken

CD-HIT Corset RapClust ,

450 181 ,

333 301 , ,

686 91 ,

653 204 , CD-HIT Corset RapClust ,

389 69 ,

107 79 , ,

115 43 ,

663 52 , CD-HIT Corset RapClust

False Positive Rate T r ue P o s i t i v e R a t e RapClust (AUC = 0.679)cdhit (AUC = 0.603)corset (AUC = 0.6)

Figure S1: ROC curves showing the recovery rate of

RapClust , Corset , and

CD-HIT ’s clusters in recov-ering differentially expressed genes in the Chicken dataset. Here, unlabeled contigs are removed from con-sideration in the predicted clusters, and the result labeled

Corset uses the clustering obtained by

Corset ,but couples them with the abundance estimates produced by

Sailfish . Under these testing conditions, allmethods demonstrate a more favorable recovery curve (as expected, since unlabeled contigs are removedfrom the clusters as in the “ground truth” expression estimates), but the accuracy of

Corset , in particular,shows the greatest improvement. This suggests that the clusters obtained by

Corset are accurate, and thatthe performance observed in Figure 3 is likely the result of