Navigating in a sea of repeats in RNA-seq without drowning
Gustavo Sacomoto, Blerina Sinaimeri, Camille Marchet, Vincent Miele, Marie-France Sagot, Vincent Lacroix
NNavigating in a sea of repeats in RNA-seq withoutdrowning
Gustavo Sacomoto , , Blerina Sinaimeri , , Camille Marchet , , Vincent Miele ,Marie-France Sagot , , and Vincent Lacroix , INRIA Rhˆone-Alpes Universit´e Lyon 1, F-69000 Lyon; CNRS, UMR5558.
Abstract.
The main challenge in de novo assembly of NGS data is certainly todeal with repeats that are longer than the reads. This is particularly true for RNA-seq data, since coverage information cannot be used to flag repeated sequences, ofwhich transposable elements are one of the main examples. Most transcriptomeassemblers are based on de Bruijn graphs and have no clear and explicit modelfor repeats in RNA-seq data, relying instead on heuristics to deal with them. Theresults of this work are twofold. First, we introduce a formal model for repre-senting high copy number repeats in RNA-seq data and exploit its properties forinferring a combinatorial characteristic of repeat-associated subgraphs. We showthat the problem of identifying in a de Bruijn graph a subgraph with this charac-teristic is NP-complete. In a second step, we show that in the specific case of alocal assembly of alternative splicing (AS) events, we can implicitly avoid suchsubgraphs. In particular, we designed and implemented an algorithm to e ffi cientlyidentify AS events that are not included in repeated regions. Finally, we validateour results using synthetic data. We also give an indication of the usefulness ofour method on real data. Transcriptomes can now be studied through sequencing. However, in the absence of areference genome, de novo assembly remains a challenging task. The main di ffi cultycertainly comes from the fact that sequencing reads are short, and repeated sequenceswithin transcriptomes could be longer than the reads. This short read / long repeat issueis of course not specific to transcriptome sequencing. It is an old problem that hasbeen around since the first algorithms for genome assembly. In this latter case, theproblem is somehow easier because the coverage can be used to discriminate contigsthat correspond to repeats, e.g. using Myer’s A-statistics [6] or [7]. In transcriptomeassembly, this idea does not apply, since the coverage of a gene does not only reflect itscopy number in the genome, but also and mostly its expression level. Some genes arehighly expressed and therefore highly covered, while most genes are poorly expressedand therefore poorly covered.Initially, it was thought that repeats would not be a major issue, since they are mostlyin introns and intergenic regions. However, the truth is that many regions which arethought to be intergenic are transcribed [2] and introns are not always spliced out yetwhen mRNA is collected to be sequenced. Repeats are therefore very present in real a r X i v : . [ c s . D S ] J un amples, especially transposable elements, and cause major problems in transcriptomicassembly.Most, if not all current short-read transcriptome assemblers are based on de Bruijngraphs. Among the best known are O ases [12], T rinity [3], and to a lesser degree T rans -A byss [9] and IDBA- tran [8]. Common to all of them is the lack of a clear and explicitmodel for repeats in RNA-seq data. Heuristics are thus used to try and cope e ffi cientlywith repeats. For instance, in O ases short nodes are thought to correspond to repeatsand are therefore not used for assembling genes. They are added in a second step, whichhopefully causes genes sharing repeats not to be assembled together. In T rinity , thereis no attempt to deal with repeats explicitly. The first module of T rinity , Inchworm,will try and assemble the most covered contig which hopefully corresponds to the mostabundant alternative transcript. Then alternative exons are glued to this major transcriptto form a splicing graph. The last step is to enumerate all alternative transcripts. Ifrepeats are present, their high coverage may be interpreted as a highly expressed linkbetween two unrelated transcripts. Overall, assembled transcripts may be chimeric orspliced into many sub-transcripts.In the method we developed, K is S plice , which is a local transcriptome assem-bler [10], repeats may be less problematic, since the goal is not to assemble full-lengthtranscripts. Instead, K is S plice aims at finding variations in the transcriptome (SNPs, in-dels and alternative splicings). However, we previously reported that we were not ableto deal with large parts of the de Bruijn graph containing subgraphs associated to highlyrepeated sequences [10].Here, we try and achieve two goals: (i) give a clear formalization of the notion ofrepeats in RNA-seq data, and (ii) give a practical way to enumerate bubbles that arelost because of such repeats. Recall that we are in a de novo context, so we assumethat neither a reference genome / transcriptome nor a database of known repeats, e.g.R epeat M asker [13], is available.In particular, we formally introduce a model for representing the repeats and exploitits properties to infer a parameter characterizing repeated-associated subgraphs in a deBruijn graph. We prove its relevance but we also show that the problem of identifyingin a de Bruijn graph a subgraph corresponding to repeats using such a characteristicis NP-complete. Hence, a polynomial time algorithm for repeat identification that usessuch characterization is unlikely. Finally, we show that in the specific case of a localassembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs.More precisely, it is possible to find the structures ( i.e. bubbles) corresponding to ASevents in a de Bruijn graph that are not contained in a repeated-associated subgraph.Finally, using simulated RNA-seq data we show that the new algorithm can improve bya factor 2 the sensitivity of K is S plice , while also improving the precision. We also givean indication of the usefulness of our method in real data. Let Σ be an alphabet of fixed size σ . Here we always assume Σ = { A , C , T , G } . Given s ∈ Σ ∗ , let | s | denote its length, s [ i ] the i th element of s , and s [ i , j ] the substring s [ i ] s [ i + . . . s [ j ] for any i < j ≤ | s | . k-mer is a sequence s ∈ Σ k . Given an integer k and a set S of sequences eachof length n ≥ k , we define span ( S , k ) as the set of all distinct k -mers that appear as asubsequence in S . Definition 1.
Given a set of sequences (reads) R ⊆ Σ ∗ and an integer k, we define thedirected de Bruijn graph G k ( R ) = ( V , A ) where V = span ( R , k ) and A = span ( R , k + . Given a directed graph G = ( V , A ) and a vertex v ∈ V , we denote its out-neighborhood (resp. in-neighborhood ) by N + ( v ) = { u | ( v , u ) ∈ A } (resp. N − ( v ) = { u | ( u , v ) ∈ A } ), and itsout-degree (in-degree) by d + ( v ) = | N + ( v ) | ( d − ( v ) = | N − ( v ) | ). A (simple) path π = s (cid:123) t in G is a sequence of distinct vertices v , . . . , v l in which v = s , v l = t and for each0 ≤ i ≤ t , ( v i , v i + ) is an arc of G . If the graph is weighted, i.e. if there is a function w : A → Q ≥ associating a weight to every arc in the graph, then the length of a path π is the sum of the weights of the traversed arcs, and is denoted by | π | .An arc ( u , v ) ∈ A is called compressible if d + ( u ) = d − ( v ) =
1. The intuitionbehind this definition comes from the fact that every path passing through u should alsopass through v . It should therefore be possible to “compress” or contract this arc withoutlosing any information. Note that the compressed de Bruijn graph [3,12] commonlyused by transcriptomic assemblers is obtained from a de Bruijn graph by replacing, foreach compressible arc ( u , v ), the vertices u , v by a new vertex x , where N − ( x ) = N − ( u ), N + ( x ) = N + ( v ) and the label is the concatenation of the k -mers of u and v without theoverlapping part. See Fig. 1 for an example of a compressible arc in a de Bruijn graph. CTGACTTCT TGA GATGAG (a)
CTGAACTTCT GATGAG (b)
Fig. 1: (a) The arc (
CTG , TGA ) is the only compressible arc in the de Bruijn graph ( k = Given a de Bruijn graph G k ( R ) generated by a set of reads R for which we do not haveany information, the aim is to identify whether there are subgraphs of G k ( R ) that corre-spond each to a set of repeats in R . To this end, we try to identify and then exploit someof the topological properties of the subgraphs that are induced by repeats. Starting witha formal model for representing the repeats, we show that the number of compressiblearcs, which we denote by γ , is a relevant parameter for such a characterization. How-ever, we also prove that, for an arbitrary de Bruijn graph, identifying a subgraph G (cid:48) withbounded γ ( G (cid:48) ) is NP-complete. .1 Simple uniform model for repeats We now present the model we adopted for representing the repetition of a same se-quence in a genome or transcriptome. This model is a simple one and as such should beseen as only a first approximation of what may happen in reality. Moreover, it allowedus to infer one characteristic of repeated-associated subgraphs in a de Bruijn graph,namely that such subgraphs contain few compressible arcs. It is important to point outhowever that the model is realistic enough in most cases. In particular, it enables tomodel well recent invasions of transposable elements which often involve a large num-ber of similar copies. Such elements are among the most important sources of repeatsin the analysis of NGS data (DNA or RNA-seq). The number of compressible arcs isone of the criteria used to define so-called short nodes in O ases but no model or theo-retical justification was given [12]. This is one of the main contributions of this paper.In future, it would be important to consider more realistic models, and we mention onein the perspectives. Mathematically, one has to be aware however that such models willbe harder to analyze.The model is as follows. First, due to mutations, the sequences s , . . . , s m that repre-sent the repeats are not identical. However, provided that the number of such mutationsis not high (otherwise the concept of repeats would not apply), the repeats are consid-ered “similar” in the sense of pair-wisely presenting a small Hamming distance. Werecall that, given two equal length sequences s and s (cid:48) in Σ n , the Hamming distance between them, denoted by d H ( s , s (cid:48) ), is the number of positions i for which s [ i ] (cid:44) s (cid:48) [ i ].The model has then the following parameters: Σ , the length n of the repeat, thenumber m of copies of the repeat, an integer k (for the length of the k -mers considered),and the mutation rate, α , i.e. the probability that a mutation happens in a particularposition.We first choose uniformly at random a sequence s ∈ Σ n . At step i ≤ m , we createa sequence s i as follows: for each position j , s i [ j ] = s [ j ] with probability 1 − α ,whereas with probability α a value di ff erent from s [ j ] is chosen uniformly at randomfor s i [ j ]. We repeat the whole process m times and thus create a set S ( m , n , α ) of m suchsequences from s (see Fig.2 for a small example). The generated sequences thus havean expected Hamming distance of α n from s . c c c c c c c c c c A A C T G T A T C C s A C C T G T A G C C s G A C T C A A T C C s A A C T C T A T C C s A A C A G T A T C A s A A T T G T A G C C s A G C T G T A T C A s ... ... ... ... ... ... ... ... ... ... A A G T G A A T C C s Fig. 2:
An example of a set of repeats S (20 , , . .2 Topological characterization of the subgraphs generated by repeats Given a de Bruijn graph G k ( R ), if a is a compressible arc labeled by the sequence s = s . . . s k + , then by definition, a is the only outgoing arc of the vertex labeled bythe sequence s [1 , k ] and the only incoming arc of the vertex labeled by the sequence s [2 , k + k − s [2 , k ] will appear as a substring in R , always precededby the symbol s [1] and followed by the symbol s [ k + k − boundary rigid . It is not di ffi cult to see that the setof compressible arcs in a de Bruijn graph G k ( R ) stands in a one-to-one correspondencewith the set of boundary rigid ( k − R .We now calculate and compare among them the expected number of compressiblearcs in G = G k ( R ) when R corresponds to a set of repeats that are generated: (i) uni-formly at random, and (ii) according to our model. We show that γ is “small” in thecases where the induced graph corresponds to similar sequences, which provides evi-dence for the relevance of this parameter. Claim.
Let R be a set of m sequences randomly chosen from Σ n . Then the expectednumber of compressible arcs in G k ( R ) is Θ ( mn ). Proof.
The probability that a sequence of length k − n is (1 / k − . Thus the expected number of appear-ances of a sequence of length k − m randomly chosen sequences of length n is given by m ( n − k + / k − . If m ( n − k + ≤ k , then this value is upper boundedby 1, and all the sequences of length k − m ( n − k +
1) di ff erent k -mers. (cid:117)(cid:116) We consider now γ ( G k ( R )) for R = S ( m , n , α ). We upper bound the expected numberof compressible arcs by upper bounding the number of boundary rigid ( k − Theorem 1.
Given integers k , n , m with k < n and a real number ≤ α ≤ / , the deBruijn graph G k ( S ( m , n , α )) has o ( nm ) expected compressible arcs.Proof. Let s be a sequence chosen randomly from Σ n . Let S ( m , n , α ) = { s , . . . , s m } be the set of m repeats generated according to our model starting from s . Considernow the de Bruijn graph G = G k ( S ( m , n , α )). Recall that the number of compressiblearcs in this graph is equal to the number of boundary rigid ( k − S ( m , n , α ).Let X be a random variable representing the number of boundary rigid ( k − G . Consider the repeats in S ( m , n , α ) in a matrix-like ordering as in Fig.2 and observethat the mutations from one column to another are independent. Due to the symmetryand the linearity of expectation, E [ X ] is given by m ( n − k −
1) (the total number of( k − k − k − s = s [ i , i + k −
2] is boundary rigid clearlydepends on the distance from the starting sequence ˆ s = s [ i , i + k − d be thedistance d H ( ˆ s , ˆ s ).Observe that if the ( k − s [ i ] . . . s [ k −
1] is not boundary rigid then there existsa sequence y in S ( m , n , α ) such that y [ j ] = s [ j ] for all i ≤ j ≤ i + k − y [ i + k − (cid:44) s [ i + k −
1] or y [ i − (cid:44) s [ i − ffi cult to see that the probabilitythat this happens is lower bounded by (2 α − / α )(1 − α ) k − − d ( α/ d . Hence we have: r [ ˆ s is boundary rigid | d H ( ˆ s , ˆ s ) = d ] ≤ (cid:32) − (2 α − / α )(1 − α ) k − − d ( α/ d (cid:33) m − By approximating the above expression we therefore have that, E [ X ] ≤ ( n − k − m k − (cid:88) d = Pr [ ˆ s is boundary rigid | d H ( ˆ s , ˆ s ) = d ] (1) ≤ ( n − k − me − ( m − α − / α ) / ( α ) k − For a su ffi ciently large number of copies ( e.g. m = (cid:16) k α k (cid:17) ) and using the fact that (cid:16) k α k (cid:17) ≥ (1 /α ) α k , we have that E [ X ] is o ( mn ). This concludes the proof. (cid:117)(cid:116) The previous result shows that the number of compressible arcs is a good parameterfor characterizing a repeat-associated subgraph.
As we showed, a subgraph due to repeated elements has a distinctive feature: namelyit contains few compressible arcs. Based on this, a natural formulation to the repeatidentification problem in RNA-seq data is to search for large enough subgraphs that donot contain many compressible arcs. This is formally stated in Problem 1. In order todisregard trivial solutions, it is necessary to require a large enough connected subgraph,otherwise any set of disconnected vertices or any small subgraph would be a solution.Unfortunately, we show that this problem is NP-complete, so an e ffi cient algorithm forthe repeat identification problem based on this formulation is unlikely. Problem 1 (Repeat Subgraph).INSTANCE:
A directed graph G and two positive integers m , t . DECIDE:
If there exists a (connected) subgraph G (cid:48) = ( V (cid:48) , E (cid:48) ), with | V (cid:48) | ≥ m andhaving at most t compressible arcs.In Theorem 2, we prove that this problem is NP-complete for all directed graphswith the (total) degree, i.e. the sum of in and out-degree, bounded by 3. The reductionis from the Steiner tree problem which requires finding a minimum weight subgraphspanning a given subset of vertices. It remains NP-hard even when all arc weights are1 or 2 (see [1]), this version is denoted by STEINER(1 , G = ( V , E ) with arc weights in { , } , a set of terminal vertices N ⊆ V and an integer B , it is NP-complete to decide if there exists a subgraphof G spanning N with weight at most B , i.e. a connected subgraph of G containing allvertices of N .We specify next a family of directed graphs that we use in the reduction. Givenan integer x we define the directed graph R ( x ) as a cycle on 2 x vertices numbered ina clockwise order and where the arcs have alternating directions, i.e. for any i ≤ x ,( v i , v i + ) is an arc. Note that in R ( x ) all vertices in even positions, i.e. v i have out-degree 2 and in-degree 0 and those v i + out-degree 0 and in-degree 2. Clearly, none ofthe arcs of R ( x ) is compressible. heorem 2. The
Repeat Subgraph Problem is NP-complete even for directed graphswith degree bounded by d, for any d ≥ .Proof. Given a complete graph G = ( V , E ), a set of terminal vertices N and an upperbound B , i.e. an instance of STEINER(1 , RepeatSubgraph Problem with a graph G (cid:48) with degree bounded by 3. Let us build the graph G (cid:48) = ( V (cid:48) , E (cid:48) ). For each vertex v in V \ N , add a corresponding subgraph r ( v ) = R ( | V | ) in G (cid:48) and for each vertex v in N , add a corresponding subgraph r ( v ) = R ( | E | + | V | +
1) in G (cid:48) . For each arc ( u , v ) in E with weight w ∈ { , } , add a simple directed path composedby w compressible arcs connecting r ( u ) to r ( v ) in G (cid:48) , the subgraphs corresponding to u and v . The first vertex of the path should be in a sink of r ( u ) and the last vertex ina source of r ( v ). By construction there are at least | V | vertices with in-degree 2 andout-degree 0 (sink) and | V | vertices with out-degree 2 and in-degree 0 (source) in both r ( v ) and r ( u ). It is clear that G (cid:48) has degree bounded by 3. Moreover, the size of G (cid:48) ispolynomial in the size of G and it can also be polynomially constructed.In this way, the graph G (cid:48) has one subgraph for each vertex of G and a path with oneor two (depending on the weight of the corresponding arc) compressible arcs for eacharc of G . Thus, there exists a subgraph spanning N in G with weight at most B if andonly if there exists a subgraph in G (cid:48) with at least m = | N | + | E || N | + | V | | N | verticesand at most t = | B | compressible arcs. This follows from the fact that any subgraph of G (cid:48) with at least m vertices necessarily contains all the subgraphs r ( v ), where v ∈ N ,since the number of vertices in all r ( v ), with v ∈ V \ N , is at most | E | + | V | and theonly compressible arcs of G (cid:48) are in the paths corresponding to the arcs of G . (cid:117)(cid:116) We can obtain the same result for the specific case of de Bruijn graphs. The reduc-tion is very similar but uses a di ff erent graph family. Theorem 3.
The
Repeat Subgraph Problem is NP-complete even for subgraphs of deBruijn graphs on | Σ | = symbols. In the previous section, we showed that an e ffi cient algorithm to directly identify thesubgraphs of a de Bruijn graph corresponding to repeated elements is unlikely to existsince the problem is NP-complete. However, in this section we show that in the specificcase of a local assembly of alternative splicing (AS) events, we can implicitly avoidsuch subgraphs. More precisely, it is possible to find the structures ( i.e. bubbles) corre-sponding to AS events in a de Bruijn graph that are not contained in a repeat associatedsubgraph, thus answering to the main open question of [10].K is S plice [10] is a method for de novo calling of AS events through the enumera-tion of so-called bubbles , that correspond to pairs of vertex-disjoint paths in a de Bruijngraph. The bubble enumeration algorithm proposed in [10] was later improved in [11].However, even the improved algorithm is not able to enumerate all bubbles correspond-ing to AS events in a de Bruijn graph. There are certain complex regions in the graph,likely containing repeat-associated subgraphs but also real AS events [10], where bothalgorithms take a huge amount of time. See Fig. 3 for an example of a complex regionig. 3: An alternative splicing event in the SCN5A gene (human) trapped inside a complex re-gion, likely containing repeat-associated subgraphs, in a de Bruijn graph. The alternative isoformscorrespond to a pair of paths shown in red and blue. with a bubble corresponding to an AS event. The enumeration is therefore halted aftera given timeout. The bubbles drowned (or trapped) inside these regions are thus missedby K is S plice . Here, we impose an extra restriction to the bubbles, reflecting the fact thatthey are in complex regions but not in a repeat-associated subgraph. In this way, we canenumerate bubbles in complex regions implicitly avoiding repeat-associated subgraphs. Definition 2 ( ( s , t , α , α , b ) -bubbles). Given a directed graph G = ( V , E ) and two ver-tices s , t ∈ V, an ( s , t , α , α , b ) -bubbles is a pair of vertex-disjoint st-paths π , π withlengths bounded by α , α , each containing at most b branching vertices. This extends ( s , t , α , α )-bubbles (Def. 1 in [11]) by adding the extra condition thateach path should have at most b branching vertices. The number of branching verticesis proportional to the number of incompressible arcs. Intuitively, if a bubble containsmany incompressible arcs, it is likely contained in a repeat-associated subgraph. Hence,by limiting the number of branching vertices (incompressible arcs) we avoid the verticesfrom repeat-associated subgraphs. Indeed, in Section 4.2 we show that by consideringbubbles with at most b branching vertices in K is S plice , we increase both sensitivity andprecision. This supports our claim that by focusing on ( s , t , α , α , b )-bubbles, we avoidrepeat-associated subgraphs and recover at least part of the bubbles trapped in complexregions. In this section, we modify the algorithm of [11] to enumerate all bubbles with at most b branching vertices in each path. Given a weighted directed graph G = ( V , E ) and avertex s ∈ V , let B s ( G ) denote the set of ( s , ∗ , α , α , b )-bubbles of G . The algorithm re-cursively partitions the solution space B s ( G ) at every call until the considered subspaceis a singleton (contains only one solution) and in that case it outputs the correspondingsolution. In order to avoid unnecessary recursive calls, it maintains the invariant that thecurrent partition contains at least one solution. It proceeds as follows. Invariant:
At a generic recursive step on vertices u , u (initially, u = u = s ), let π = s (cid:123) u , π = s (cid:123) u be the paths discovered so far (initially, π , π are empty).et G (cid:48) be the current graph (initially, G (cid:48) : = G ). More precisely, G (cid:48) is defined as follows:remove from G all the vertices in π and π but u and u . Moreover, we also maintainthe following invariant ( ∗ ): there exists at least one pair of paths ¯ π and ¯ π in G (cid:48) thatextends π and π so that π · ¯ π and π · ¯ π belongs to B s ( G ). Base case:
When u = u = u , output the ( s , u , α , α , b )-bubble given by π and π . Recursive rule:
Let B s ( π , π , G (cid:48) ) denote the set of ( s , ∗ , α , α , b )-bubbles to be listedby the current recursive call, i.e. the subset of B s ( G ) with prefixes π , π . Then, it is theunion of the following disjoint sets . – The bubbles of B s ( π , π , G (cid:48) ) that use e , for each arc e = ( u , v ) out-going from u ,that is B s ( π · e , π , G (cid:48) − u ), where G (cid:48) − u is the subgraph of G (cid:48) after the removalof u and all its incident arcs. – The bubbles that do not use any arc from u , that is B s ( π , π , G (cid:48)(cid:48) ), where G (cid:48)(cid:48) is thesubgraph of G (cid:48) after the removal of all arcs out-going from u .In order to maintain the invariant ( ∗ ), we only perform the recursive calls when B s ( π · e , π , G (cid:48) − u ) or B s ( π , π , G (cid:48)(cid:48) ) are non-empty. In both cases, we have to decideif there exists a pair of (internally) vertex-disjoint paths ¯ π = u (cid:123) t and ¯ π = u (cid:123) t , such that | ¯ π | ≤ α (cid:48) , | ¯ π | ≤ α (cid:48) , and ¯ π , ¯ π have at most b , b branching vertices,respectively. Since both the length and the number of branching vertices are monotonicproperties, i.e. the length and the number of branching vertices of a path prefix is smallerthan this number for the full path, we can drop the vertex-disjoint condition. Indeed, let¯ π and ¯ π be a pair of paths satisfying all conditions but the vertex-disjointness one.The prefixes ¯ π = u (cid:123) t ∗ and ¯ π = u (cid:123) t ∗ , where t ∗ is the first intersection ofthe paths, satisfy all conditions and are internally vertex-disjoint. Moreover, using adynamic programming algorithm, we can obtain the following result. Lemma 1.
Given a non-negatively weighted directed graph G = ( V , E ) and a sources ∈ V, we can compute all shortest paths from s using at most b branching vertices inO ( b | V || E | ) time. As a corollary, we can decide if B s ( π , π , G ) is non-empty in O ( b | V || E | ). Now, usingan argument similar to [11], i.e. leaves of the recursion tree and solutions are in one-to-one correspondence and the height of the recursion tree is bounded by 2 n , we obtainthe following theorem. Theorem 4.
The ( s , ∗ , α , α , b ) -bubbles can be enumerated in O ( b | V | | E ||B s ( G ) | ) time.Moreover, the time elapsed between the output of any two consecutive solutions (delay)is O ( b | V | | E | ) . To evaluate the performance of our method, we simulated RNA-seq data using theF lux S imulator version 1.2.1 [4]. We generated 100 million reads of 75 bp with thedefault error model provided by the F lux S imulator . We used the RefSeq annotated The same holds for u instead of u . uman transcriptome (hg19 coordinates) as a reference and we performed a two-steppipeline to obtain a mixture of mRNA and pre-mRNA (i.e. with introns not yet spliced).To achieve this, we first ran the F lux S imulator with default settings with the Refseq an-notations. Then we modified the annotations to include the introns and re-ran F lux S im - ulator on this modified version. In this second run, we additionally constrained theexpression values of the pre-mRNAs to be correlated to the expression values of theircorresponding mRNAs, as simulated in the first run. Finally, we mixed the two setsof reads to obtain a total of 100M reads. We tested two values: 5% and 15% for theproportion of reads stemming from pre-mRNAs. Those values were chosen so as tocorrespond to realistic ones as observed in a cytoplasmic mRNA extraction (5%) and atotal (cytoplasmic + nuclear) mRNA extraction [14].On this simulated dataset, we ran K is S plice [10] version 2.1.0 (K s O ld ) and 2.2.0(K s N ew , with a maximum number of branching nodes set to 5) and obtained lists ofdetected bubbles that are putative alternative splicing (AS) events.In order to assess the precision and the sensitivity of our method, we comparedour set of found AS events to the set of true
AS events. Following the definition ofA stalavista , an AS event is composed of two sets of transcripts, the inclusion / exclusionisoforms respectively. An AS event is said to be true if at least one transcript among theinclusion isoforms and one among the exclusion isoforms is present in the simulateddataset with at least one read. We outline that this definition is very permissive andincludes AS events with very infrequent transcripts.To compare the results of K is S plice with the true AS events, we propose that a trueAS event is found (counted as a true positive (TP)) if there is a bubble with one pathcorresponding to the inclusion set and the other to the exclusion set. If not, the event iscounted as a false negative (FN). In the meantime, if a bubble does not correspond toany true AS event, it is counted as a false positive (FP). To align the paths of the bubblesto transcript sequences, we used the B lat aligner [5] with 95% identity and a constraintof 95% of each bubble path length to be aligned (to account for the sequencing errorssimulated by F lux S imulator ). We computed the sensitivity TP / (TP + FN) and precisionTP / (TP + FP) for each simulation case and we report their values for various classes ofexpression of the minor isoform. Expression values are measured in reads per kilobase(RPK).The plots for the sensitivity of each version on the two simulated datasets are shownin Fig. 4. On the one hand, both versions of K is S plice have similar sensitivity in the5% pre-mRNA dataset, with K s N ew performing slightly better, especially for highlyexpressed variants. On the other hand, the sensitivity of the new version is considerablybetter over all classes of expression levels in the 15% pre-mRNA dataset. In this case,the precisions for K s N ew and K s O ld are 21% and 41%, respectively. This represents animprovement of almost 100% over the old version. The results reflect the fact that themost problematic repeats are in intronic regions. A small unspliced mRNA rate leadsto few repeat-associated subgraphs, so there are not many AS events drowned in them(which are then missed by K s O ld ). In this case, the advantage of using K s N ew is lessobvious, whereas a large proportion of pre-mRNA leads to more AS events drowned inrepeat-associated subgraphs which are identified by K s N ew and missed by K s O ld .learly, any improvement in the sensitivity is meaningless if there is also a signifi-cant decrease in precision. This is not the case here. In both datasets, K s N ew improves the precision of K s O ld . It increases from 95% to 98% and from 90% to 99%, in the 5%and 15% datasets, respectively. Moreover, both running times and memory consump-tion are very similar for both versions.In order to give an indication of the usefulness of our repeat-avoiding bubble enu-meration algorithm with real data, we also ran K s N ew and K s O ld in the SK-N-SHHuman neuroblastoma cell line RNA-seq dataset (wgEncodeEH000169, total RNA). InFig. 5 we have an example of a non-annotated exon skipping event not found by K s O ld .Observe that the intronic region contains several transposable elements (many of whichare Alu sequences), while the exons contain none. This is a good example of a bubble(exon skipping event) drowned in a complex region of the de Bruijn graph. The bubble(composed by the two alternative paths) itself contains no repeated elements, but it issurrounded by them. In other words, this is a bubble with few branching vertices thatis surrounded by repeat-associated subgraphs. Since K s O ld is unable to di ff erentiatebetween repeat-associated subgraphs and the bubble, it spends a prohibitive amount oftime in the repeat-associated subgraph and fails to find the bubble. −2 −1 0 1 2 3 . . . . . .
5% pre−mRNA
RPK (log scale) S en s i t i v i t y KsNewKsOld −1 0 1 2 3 . . . . . .
15% pre−mRNA
RPK (log scale) S en s i t i v i t y KsNewKsOld
Fig. 4:
Sensitivity of K s N ew and K s O ld for several classes of expression of the minor isoform.Each class contains the same number of AS events. Although transcriptome assemblers are now commonly used, their way to handle re-peats is not satisfactory, arguably because the presence of repeats in transcriptomeshas been underestimated so far. Given that most RNA-seq datasets correspond to totalmRNA extractions, many introns are still present in the data and their repeat contentig. 5:
One of the bubbles found only by K s N ew with the corresponding sequences mapped to thereference human genome and visualized using the UCSC Genome Browser. The first two linescorrespond to the sequences of, respectively, the shortest (exon exclusion variant) and longestpaths of the bubble mapped to the genome. The blue line is the Refseq annotation. The last lineshows the annotated SINE and LINE sequences (transposable elements). cannot be simply ignored. In this paper, we propose a simple model for repeats. Clearlythis model could be improved, for instance by using a tree-like structure to take intoaccount the evolutionary nature of repeat (sub)families. Variability in the sizes of thecopies of a repeat family would also enable to model more realistically the true natureof families of transposable elements (the type of repeats which cause most trouble inassembly). Certainly, a mathematical analysis of a more realistic model would be moredi ffi cult to obtain. On the other hand, our simple model captures an important quali-tative characteristic of repeat-associated subgraphs: the presence of few compressiblearcs. This characterization allows us to design an e ffi cient algorithm to identify bub-bles corresponding to AS events implicitly avoiding repeat-associated subgraphs. Thisapproach improves both the sensitivity and the precision of K is S plice . References
1. M. Bern and P. Plassmann. The steiner problem with edge lengths 1 and 2.
InformationProcessing Letters , 1989.2. S. Djebali, C.A. Davis, A. Merkel, and A. Dobin et al.
Landscape of transcription in humancells.
Nature , (7414):101–108, 2012.3. M. G. Grabherr, B. J. Haas, M. Yassour, and J.Z. et al.
Levin. Full-length transcriptomeassembly from RNA-Seq data without a reference genome.
Nature Biot. , 2011.4. T. Griebel, B. Zacher, P. Ribeca, and E. Raineri et al.
Modelling and simulating genericRNA-Seq experiments with the flux simulator.
Nucleic Acids Res. , 2012.5. W. J. Kent. BLAT–the BLAST-like alignment tool.
Genome Research , 12(4):656–664, 2002.6. E. W. Myers, G. G. Sutton, A. L. Delcher, and I.M. Dew et al.
A whole-genome assemblyof drosophila.
Science , 287(5461):2196–204, March 2000.7. P. Nov´ak, P. Neumann, and J. Macas. Graph-based clustering and characterization of repeti-tive sequences in next-generation sequencing data.
BMC Bioinformatics , 2010.8. Y. Peng, H. C. M. Leung, S.-M. Yiu, and M.-J. Lv et al.
IDBA-tran: a more robust de novode bruijn graph assembler for transcriptomes with uneven expression levels.
Bioinformatics ,29(13):326–334, 2013.9. G. Robertson, J. Schein, R. Chiu, and R. Corbett et al.
De novo assembly and analysis ofRNA-seq data.
Nature methods , 7(11):909–912, 2010.10. G. Sacomoto, J. Kielbassa, R. Chikhi, and R. Uricaru et al.
KISSPLICE: de-novo callingalternative splicing events from RNA-seq data.
BMC Bioinformatics , 13(Suppl 6):S5, 2012.1. G. Sacomoto, V. Lacroix, and M.-F. Sagot. A polynomial delay algorithm for the enumera-tion of bubbles with length constraints in directed graphs and its application to the detectionof alternative splicing in rna-seq data. In
WABI , pages 99–111, 2013.12. M. H. Schulz, D. R. Zerbino, M. Vingron, and E. Birney. Oases: robust de novo RNA-seqassembly across the dynamic range of expression levels.
Bioinformatics , 2012.13. A. F. A. Smit, R. Hubley, and P. Green. RepeatMasker Open-3.0, 1996-2004.14. H. Tilgner, D. G. Knowles, R. Johnson, and C. A. Davis et al.