An observation of circular RNAs in bacterial RNA-seq data
Nicolas Innocenti, Hoang-Son Nguyen, Aymeric Fouquier d'hérouël, Erik Aurell
AAn observation of circular RNAs in bacterial RNA-seq data.
Nicolas Innocenti,
1, 2
Hoang-Son Nguyen, Aymeric Fouquier d’H´erou¨el, and Erik Aurell
1, 5 Department of Computational Biology, KTH Royal Institute of Technology,AlbaNova University Center, Roslagstullsbacken 17, SE-10691 Stockholm, Sweden Combient AB, Nettov¨agen 6, SE-17541, J¨arf¨alla, Sweden System Biology Programme, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden Luxembourg Centre for Systems Biomedicine, University of Luxembourg,7, Avenue des Hauts-Fourneaux, L-4362 Esch-sur-Alzette, Luxembourg Department of Information and Computer Science,Aalto University, Konemiehentie 2, FI-02150 Espoo, Finland
Circular RNAs (circRNAs) are a class of RNA with an important role in micro RNA(miRNA) regulation recently discovered in Human and various other eukaryotes as well asin archaea. Here, we have analyzed RNA-seq data obtained from
Enterococcus faecalis and
Escherichia coli in a way similar to previous studies performed on eukaryotes. We reportobservations of circRNAs in RNA-seq data that are reproducible across multiple experimentsperformed with different protocols or growth conditions.
Circular RNAs (circRNAs) are a type of RNAtranscripts the 3’ends of which are ligated totheir 5’ends during splicing. They were first dis-covered in plants [1] almost 40 years ago andrelatively little studied for several decades. Re-cently, an incidental discovery of a high abun-dance of circRNAs in human [2] and their rolein micro RNA (miRNA) regulation [3] attractedthe attention of many. Large scale searches fornovel circRNAs was performed in animals [4]and archaea [5] leading to hundreds of previ-ously unknown circularised transcripts. While ithas been shown that splicing in bacteria occurs,though much more rarely than in eukaryotes [6],that circRNAs can be synthesized by bacteria invivo using genetic engineering, and that thosetranscripts can even direct translations of pro-teins in those organisms, it is commonly believedthat all natural transcripts in bacteria are linear[7].Our starting point is a slightly modified ver-sion of the well established pipeline used byMemczak et al. [4] to analyse our recently pub-lished transcriptome data on
E. faecalis v583[8]. The major difference of our version of thepipeline is that we removed the requirement forbreak points to be flanked by GU/AG sites, assuch a configuration is characteristic of spliceoso-mal splicing, absent in bacteria. As a result, thealgorithm reports as potential circRNA all caseswhere the 3’ region of a read can be mapped up- stream of its 5’ region and where the sequenceon the reference genome near them correspondsto the one in the read insert (Figure 1(a) ).The discovery pipeline was applied to thetranscriptome coined ”IlluminaSt” [SRA acces-sion ID : SRR1770393] in our previous nomen-clature [8], leading to a total of 134 circRNA can-didates (Table S1). Based on these, we built acustom database of junctions encompassing 250nt on each side of the predicted junctions (Figure1(a)). This database was subsequently used asreference against which we aligned whole readsdirectly. After alignment, we counted the num-ber of reads that spanned the junction, and readsextending 8 nt or more on both sides of a junc-tion were termed ”long spanning reads” (Figure1(b)).Alignment of the same ”IlluminaSt” tran-scriptome reads yielded an enrichment of longspanning reads (Table S1) caused by relaxed con-straints since anchors in our 50 nt long readswere required to be 20 nt long: a junction neededto be in a region of 10 nt to be identified bythe detection algorithm, which, assuming uni-form RNA fragmentation, would happen in 20%of the cases. Few large deviations from the typi-cally observed two-to-ten-fold enrichment (TableS1) are plausibly explained by fragmentation bi-ases in the sequencing sample preparation [9].In addition, we mapped reads from fourother
E. faecalis
V583 transcriptomes [8] named a r X i v : . [ q - b i o . GN ] J un
5’ anchor 3’ anchorInsert Mapped 5’ anchorMapped 3’ anchor TailHead Junction250 nt 250 nt
Read from RNA-seqMapping on reference genomeNew reference sequence for re-alignment (a) (b)
FIG. 1: (a) In the discovery phase, reads are cut in silico into three parts : a region of 20 nt at the5’end (5’anchor), another of 20 nt at the 3’end (3’anchor) and the middle region containing the remainingnucleotides of the reads. For reads that contain a tail-to-head junction the insert, the discovery pipeline willmap the 5’anchor downstream of the 3’anchor and then use the insert sequence to recover the beginningand end of the RNA transcript before circularisation. Once this position has been found, a new referencesequence is created to be used for a second mapping procedure. (b) Schematic representation of reads afteraligning to the new reference built during the discovery phase. Reads that map entirely on one side ofthe junction are irrelevant to our purpose. Each read that maps across the junction votes in favor of itsexistence, with a higher confidence for reads having a longer sequences on each side. ”KTH”, ”KTHr”, ”Rt” and ”St” [SRA accessionIDs: SRR1766416, SRR1769261, SRR1769749,SRR1769750] acquired using different experi-mental protocols, growth conditions and se-quencing platforms. Out of the previously men-tioned 134 candidates, 81 (60%) were confirmedby at least two long spanning reads from at least two transcriptomes (Table S1).Removing cases with multiply mapped readfragments and retaining only candidates sup-ported by at least five non-duplicated reads, weextracted a list of 23 transcripts with lengthsbetween 56 and 549 nt for further analysis (Ta-ble S1). Among these, 15 (65%) were foundwith long spanning reads in at least two tran-scriptomes, and only one case (circ 000082 cor-responding to the sRNA ref114) was observedonly in ”IlluminaSt”. We then verified that,while antisense transcription was observed in theneighborhood of several candidates, in all butone case no long antisense spanning reads wereobserved, with the exception of the candidate”circ 000104”, discussed in more detail below.We retained circ 000104 because it was foundin all transcriptomes analyzed, had the high-est number of long-spanning reads in the KTHtranscriptome, and the antisense signal acrossits junction was about 20 times weaker thanthe direct signal. In all the 23 cases, coveragesignal clearly indicates presence of transcriptsacross the entire candidate sequences (See ”TheppRNomebrowse ”[12][8]).In order to estimate the false discovery rateof our procedure, we generated a database of5000 tail-to-head junctions up to 600 nt apartrandomly distributed over the genome with auniform distribution and assessed the number oflong spanning reads found covering those junc-tions. Indeed not a single one was reported whileover 95% the database had reads aligned in anirrelevant way near the junction, strengtheningthe confidence in our analysis of the true tran-scriptomic data.Of the 23 retrieved circRNA candidates, 11correspond to short RNAs (sRNAs) previouslydescribed, the majority of which having so faruncharacterised functions, 5 overlap with un-translated regions (UTRs) sometimes includ-ing part of a neighbouring coding region, 6roughly match the locations of annotated genes(including two candidates in the vicinity of
EF 0104/arcA ) and 1 coincides with a trypto-phan transfer RNA ( tRNA-Trp1 ) (TableS1).A striking observation is a set of 3 candi-dates (circ 000023, circ 000025 and circ 000104)in a region we recently reported as containingan antisense organisation of two or three ncR-NAs called Ref85A/B and Ref86 [8]. On the for-ward strand, we observes two tail-to-head junc-tions linking coordinates 1.894.992 to 1.895.401and 1.894.992/-3 to 1.895.183/-6, respectively.Position 1.894.992 coincides with the mappedtranscription start site (TSS) of Ref85A[8]. The other coordinates correspond to the 3’ of Ref85Band to a position 20 nt upstream of the 3’ ofRef86 on the opposite strand (Figure 2). On thereverse strand, the detected junction links posi-tion 1.895.168 to 1.895.273, i.e., 2 nt upstreamof the reported beginning of Ref86 to exactly itsreported end.While the role of such an intricate organiza-tion remains elusive, the above observation clar-ifies that Ref85A and Ref85B originate from thesame primary transcript and that Ref86 may in-deed interact with Ref85A/B.While it is well known that artifacts inthe reverse transcription, particularly templateswitching, may lead to junction reads similarto those resulting from splicing or circularisa-tion, those effects are expected to be random[10]. The presence of non-identical (long) span-ning reads across multiple experiments is an ar-gument against artifactual origin. Moreover, theclear tendency for the candidates to be found innon-protein coding regions (UTRs and ncRNAs)that constitute a minor portion of the genomestrengthens the biological origin of our observa-tions.To verify that the effect is not limited to theparticular bacterial strain used or to our exper-imental protocol, we then applied our discov-ery pipeline to 23 publicly available RNA-seqdatasets of
E. coli . Datasets were taken fromthe SRA database and chosen arbitrarily amongexperiments done with single-end strand-specificstandard RNA-seq on the Illumina sequencingplatform, on strain MG1655 or closely relatedones, and preferentially performed at high cov-erage (accession numbers for the datasets andresults of the analyses are given in table S2).From the 23 transcriptomes, we obtained atotal of about 20 000 predictions that cluster in2660 candidate loci (Table S2), indicating thatpredictions accross different samples lead to sim-ilar positions. Among those 2660 loci, 316 wereobserved in at least 7 transcriptomes (Table S2).To provide a global overview, we plotted thedensity of predictions across all 23 samples inFigure 2 and, in Figure S1, the coverage signalcalculated from the reads on which predictionsare based for the 316 regions. The two strongestpeaks in the plots correspond to csrB and csrC ,
20 nt 2 nt
Circ_000023 : 1894992 -> 1895401Circ_000025 : 1894992/-3 -> 1895183/-6Circ_000104 : 1895273 -> 1895168
FIG. 2: Three head-to-tail junctions detected in a region of the genome of
E. faecalis containing previouslyreported ncRNAs region, suggesting an intricate interaction between the RNA transcripts. two short non-coding RNAs with the same func-tion : binding multiple copies of the CsrA pro-tein, which is itself an inhibitor of translation,in order to inhibit its function and consequentlyincrease protein synthesis. Both have similarstructures, forming many hair-pins on an oth-erwise opened backbone [11]. Smaller peaks inFigure 2 correspond to hdeA and B , an operoncoding for two well known proteins in E. coli , yehD , a region annotated as putative protein,and rnpB, the catalytic subunit of RNase P.The latter is also found circularised in E. fae-calis as candidates circ 000068, and especiallycirc 000048 which is one the strongest signalsdiscussed above.To summarize, we discovered a series of po-tential circular RNA candidates in bacteria usingpublicly available discovery pipeline and RNA-seq data. We have verified that a number of thecandidates appear to be reproducible across ex-periments with different RNA-seq protocols andplatforms and the case of two or possibly threencRNAs antisense in
E. faecalis to each othershowing signs of circularisation in an intricatemanner.A result of our analysis is that most can-didates were not found in coding regions butrather in functional RNA transcripts or UTRs.In many cases, candidates found were repro-ducible across experiments with good accuracy.In particular the rnpB sRNA shows one of thestrongest signals in both
E. faecalis and
E. coli .We finally remark that in all samples, the signalfrom putative circRNA was weak compared to the total amount of RNA detected in the cor-responding regions, suggesting that either thecircularized form is rare byproduct of a post-transcriptional process or that the circular formis not well captured by the RNA-seq procedure,for instance by escaping fragmentation.
Authors’ Contributions
A.F.d’H. and N.I. formulated the researchproblem. N.I., A.F.d’H., and E.A. designed thework. H.-S.N. implemented the computationalpipeline. NI and H.-S.N. performed data analy-sis. N.I. and E.A. wrote the paper. All authorscritically reviewed the manuscript. [1] Sanger, H. L., Klotz, G., Riesner, D., Gross,H. J. & Kleinschmidt, A. K. Viroids aresingle-stranded covalently closed circular RNAmolecules existing as highly base-paired rod-likestructures.
Proc. Natl. Acad. Sci. USA ,3852–3856 (1976).[2] Salzman, J., Gawad, C., Wang, P. L., Lacayo,N. & Brown, P. O. Circular RNAs are the pre-dominant transcript isoform from hundreds ofhuman genes in diverse cell types. PLoS ONE , e30733 (2012).[3] Hansen, T. B. et al. Natural RNA circles func-tion as efficient microRNA sponges.
Nature ,384–388 (2013).[4] Memczak, S. et al.
Circular RNAs are a largeclass of animal RNAs with regulatory potency.
Nature , 333–338 (2013).
Coli k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k csrB E. Coli (MG1655) rnpB csrChdeA-B yehD
Density of predictionsRibosomal RNA loci
FIG. 3: Density of predicted circular RNAs in E. coli for the 23
E. coli transcriptomes analysed. The plotuses a linear scale and is normalised to the maximum value across the genome (the peak at csrB ).[5] Danan, M., Schwartz, S., Edelheit, S. & Sorek,R. Transcriptome-wide discovery of circularRNAs in archaea.
Nucleic Acids Res. (2011).[6] Woodson, S. A. Ironing out the kinks: splic-ing and translation in bacteria.
Genes Dev. ,1243–1247 (1998).[7] Perriman, R. & Ares, M. Circular mRNA candirect translation of extremely long repeating-sequence proteins in vivo. RNA , 1047–1054(1998).[8] Innocenti, N. et al. Whole genome mapping of5’ RNA ends in bacteria by tagged sequencing :A comprehensive view in
Enterococcus faecalis . RNA (2015).[9] Roberts, A., Trapnell, C., Donaghey, J., Rinn,J. L. & Pachter, L. Improving RNA-Seq expres-sion estimates by correcting for fragment bias.
Genome Biol. , R22 (2011).[10] Jeck, W. R. & Sharpless, N. E. Detecting andcharacterizing circular rnas. Nat. Biotech. ,453–461 (2014). [11] Gardner, P. P. et al. Rfam: updates to the rnafamilies database.
Nucleic Acids Res. , D136–D140 (2009).[12] http://ebio.u-psud.fr/eBIO_BDD.php C o li k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k c s r B E . C o l i ( M G ) r np B c s r C hd e A - B y e h D C o v e r a g e f r o m j un c t i o n r e a d s R i b o s o m a l R NA l o c i F i g u r e S : C o v e r ag e s i g n a l c a l c u l a t e d f r o m j un c t i o n r e a d s ( c o l u m n r e a d s i n t a b l e S ) f o r t h e r e g i o n s t h a t c o n t a i n p r e d i c t i o n s o f c i r c u l a r R NA s i n a t l e a s t s a m p l e s ..