[PDF] A Pipeline for Insertion Sequence Detection and Study for Bacterial Genome

Abstract

Insertion Sequences (ISs) are small DNA segments that have the ability of moving themselves into genomes. These types of mobile genetic elements (MGEs) seem to play an essential role in genomes rearrangements and evolution of prokaryotic genomes, but the tools that deal with discovering ISs in an efficient and accurate way are still too few and not totally precise. Two main factors have big effects on IS discovery, namely: genes annotation and functionality prediction. Indeed, some specific genes called "transposases" are enzymes that are responsible of the production and catalysis for such transposition, but there is currently no fully accurate method that could decide whether a given predicted gene is either a real transposase or not. This is why authors of this article aim at designing a novel pipeline for ISs detection and classification, which embeds the most recently available tools developed in this field of research, namely OASIS (Optimized Annotation System for Insertion Sequence) and ISFinder database (an up-to-date and accurate repository of known insertion sequences). As this latter depend on predicted coding sequences, the proposed pipeline will encompass too various kinds of bacterial genes annotation tools (that is, Prokka, BASys, and Prodigal). A complete IS detection and classification pipeline is then proposed and tested on a set of 23 complete genomes of Pseudomonas aeruginosa. This pipeline can also be used as an investigator of annotation tools performance, which has led us to conclude that Prodigal is the best software for IS prediction. A deepen study regarding IS elements in P.aeruginosa has then been conducted, leading to the conclusion that close genomes inside this species have also a close numbers of IS families and groups.

Full PDF

AA Pipeline for Insertion Sequence Detection and Study for BacterialGenome

Huda Al-Nayyef , , Christophe Guyeux , and Jacques M. Bahi FEMTO-ST Institute, UMR 6174 CNRS, DISC Computer Science DepartmentUniversit´e de Franche-Comt´e, 16, Rue de Gray, 25000 Besan¸con, France Computer Science Department, University of Mustansiriyah, Iraq { huda.al-nayyef, christophe.guyeux, jacques.bahi } @univ-fcomte.frJune 27, 2017 Abstract

Insertion Sequences (ISs) are small DNA segments that have the ability of moving themselves intogenomes. These types of mobile genetic elements (MGEs) seem to play an essential role in genomesrearrangements and evolution of prokaryotic genomes, but the tools that deal with discovering ISs in aneﬃcient and accurate way are still too few and not totally precise. Two main factors have big eﬀects onIS discovery, namely: genes annotation and functionality prediction. Indeed, some speciﬁc genes called“transposases” are enzymes that are responsible of the production and catalysis for such transposition,but there is currently no fully accurate method that could decide whether a given predicted gene is eithera real transposase or not. This is why authors of this article aim at designing a novel pipeline for ISsdetection and classiﬁcation, which embeds the most recently available tools developed in this ﬁeld ofresearch, namely OASIS (Optimized Annotation System for Insertion Sequence) and ISFinder database(an up-to-date and accurate repository of known insertion sequences). As this latter depend on predictedcoding sequences, the proposed pipeline will encompass too various kinds of bacterial genes annotationtools (that is, Prokka, BASys, and Prodigal). A complete IS detection and classiﬁcation pipeline is thenproposed and tested on a set of 23 complete genomes of

Pseudomonas aeruginosa . This pipeline can alsobe used as an investigator of annotation tools performance, which has led us to conclude that Prodigal isthe best software for IS prediction. A deepen study regarding IS elements in

P.aeruginosa has then beenconducted, leading to the conclusion that close genomes inside this species have also a close numbers ofIS families and groups.

The number of completely sequenced bacterial and archaeal genomes are rising steadily, such an increasingmakes it possible to develop novel kind of large scale approaches to understand genomes structure andevolution over time. Gene content prediction and genome comparison have both provided new importantinformation and deciphering keys to understand evolution of prokaryotes [14]. Important sequences inunderstanding rearrangement of genomes during evolution are so-called transposable elements (TEs), whichare DNA fragments or segments that have the ability to insert themselves into new chromosomal locations,and often make duplicate copies of themselves during transposition process [5]. Remark that, in bacterialreign, only cut-and-paste mechanism of transposition can be found, the transposable elements involved insuch a move being the insertion sequences (ISs).Insertion sequences range in size from 600 to more than 3000 bp. They are divided into 26 main diﬀerentfamilies in prokaryotes, as described in ISFinder [12], an international reference database for bacterial and a r X i v : . [ q - b i o . GN ] J un rchaeal ISs that includes background information on transposons. The main function of ISFinder is to assignIS names and to produce a focal point for a coherent nomenclature for all discovered insertion sequences.This database includes over than 3500 bacterial ISs [6,17]. Data come from a detection of repeated patterns,which can be easily found by using homology-based techniques [3]. Classiﬁcation process of families, forits part, depends on transposases homology and overall genetic organization. Indeed, most ISs consist ofshort inverted repeat sequences that ﬂank one or more open reading frames (ORFs, see Figure 1), whoseproducts encode the transposase proteins necessary for transposition process. The main problem with suchapproaches for ISs detection and classiﬁcation is that they are obviously highly dependent on the annotations,and existing tools evoked above only use the NCBI ones, whose quality is limited and very variable.In this research work, the authors’ intention is to ﬁnd an accurate method for discovering insertionsequences in prokaryotic genomes. To achieve this goal, we propose to use one of the most recent computa-tional tool for automated annotation of insertion sequences, namely OASIS, together with the internationaldatabase for all known IS sequences (ISFinder). More precisely, OASIS works with genbank ﬁles that havefully described genes functionality: this tool identiﬁes ISs in each genome by ﬁnding conserved regionssurrounding already-annotated transposases. Such technique makes it possible to discover new insertionsequences, even if they are not in ISFinder database. A novel pipeline that solves the dependence on NCBIannotations, and that works with any annotation tool (with or without description of gene functionality) isthen proposed. The output of our pipeline contains all detected IS sequences supported with other importantinformation like inverted repeats (IRs) sequences, lengths, positions, names of family and group, and otherdetails that help in studying IS structures.The contributions of this article can be summarized as follows. (1) A pipeline for insertion sequencesdiscovery and classiﬁcation is proposed, which does not depend on NCBI annotations. It uses unannotatedgenomes and embeds various annotation tools speciﬁc to Bacteria (such as Prokka, BASys, and Prodigal)in its process. (2) Overlapping and consensus problems that naturally appear after merging annotationmethods recalled above are solved, in order to obtain large and accurate number of ISs with their names offamilies and groups. And ﬁnally (3) the pipeline is tested on a set of 23 complete genomes of Pseudomonasaeruginosa , and biological consequences are outlined.Figure 1: IS element types [17]The remainder of this article is organized as follows. In Section 2, various tools for discovering IS elementsin diﬀerent species of Bacteria and Archaea are presented. The suggested methodology for increasing both thenumber and accuracy of detecting IS elements is explained in Section 3. The pipeline is detailed in Section 4,while an application example using 23 completed genomes of

P. aeurigonsa is provided in Section 5. Thisarticle ends by a conclusion section, in which the contributions are summarized and intended future work isdetailed.

The study on the plant-pathogenic prokaryote

Xanthomonas oryzae pv. oryzae (Xoo) , which causes bacterialblight (one of the most important diseases of rice) was published in 2005 by Ochiai et al. [8]. They usedGeneHacker [16], GenomeGambler version 1.51, and Glimmer program [2] for coding sequence prediction.Insertion sequences were ﬁnally classiﬁed by a BLAST analysis using ISFinder database evoked previously.2

Scan , developed by Wagner et al. [15], has then been proposed in 2007. Inverted repeats are foundusing smith waterman local alignments on transposase references found with BLAST and used as a localdatabase. This tool has been applied on 438 completely sequenced bacterial genomes by using BLAST withreferenced transposases, to determine which transposases are related to insertion sequences. Touchon etal. , for their parts, have analyzed 262 diﬀerent bacterial and archaeal genomes downloaded from GenBankNCBI in 2007 [13]. A coding sequence has then been considered as an IS element if its BLASTP best hit inISFinder database has an e-value lower than 10 − . ISA has been created by Zhou et al. in 2008 [17]. This annotation program depends on both NCBIannotations and ISFinder. More precisely, authors manually collected 1,356 IS elements with both sequencesand terminal signals from the ISFinder database, which have been used as templates for identiﬁcation of allIS elements and map construction in the targeted genomes. ISA, which is not publicly available, has ﬁnallybeen used for an analysis of 19 cyanobacterial and 31 archaeal annotated genomes downloaded from NCBI.In 2010, Plague et al. analyzed the neighboring gene orientations (NGOs) of all ISs in 326 fully sequencedbacterial chromosomes. They obtained primary annotations from the Comprehensive Microbial Resourcedatabase (release 1.0-20.0) at the Institute for Genomic Research . Their approach for extracting IS elementsfrom these genomes was to consider that a coding sequence with a best BLASTX hit e-value lower than 10 − is an insertion sequence [9]. ISsage , for its part, has been developed in 2011 by Varani et al. [14]. Theyused eight diﬀerent bacterial genomes downloaded from NCBI, and produced a web application pipeline thatallows semi-automated annotation based on BLAST against the ISFinder database. However ISsage cannotautomatically identify new insertion sequences which are not already present in ISFinder database.A new computational tool for automated annotation of ISs has then been released in 2012 by Robinson et al. [10]. This tool has been called

OASIS , which stands for “Optimized Annotation System for Inser-tion Sequences”. They worked with 1,737 bacterial and archaeal genomes downloaded from NCBI. OASISidentiﬁes ISs in each genome by ﬁnding conserved regions surrounding already-annotated transposase genes.OASIS uses a maximum likelihood algorithm to determine the edges of multicopy ISs based on conserva-tion between their surrounding regions. For deﬁning inverted repeats, the same strategy as IScan was used(Smith-Waterman alignment). Authors also used hierarchical agglomerative clustering to identify groups ofIS lengths. The ISs set is then classiﬁed according to the family and group after a BLASTP best hit inISFinder database with an e-value lower than 10 − . When a cluster cannot match with any entry of thedatabase, the IS set is considered as new. Thus OASIS has the ability to discover new insertion sequences,that is, which cannot be found in ISFinder. http://cmr.tigr.org/tigr-scripts/CMR/CmrHomePage.cgi P. aeruginosa

INSDC(Genbank) Refseqs Input CenomesIndex GenomeName GID GID GID Accession no.

Finally, in 2014, the analysis of the NGOs for all IS elements within 155 fully sequenced Archaea genomeswas presented by Florek et al. [4]. To do so, they have launched a BLASTP in the ISFinder, with an e-valueless than or equal to 10 − , for all protein coding sequences downloaded from NCBI which are related to ISelements.Two major concerns with the state of the art detailed above can be emphasized. Firstly, most of themcannot detect new insertion sequences. Secondly, all these tools are based on NCBI annotations of veryrelative and variable qualities – except ISsaga, which could work with other annotation tools (but it dependsonly on transposase ORFs that have been already deﬁned in ISFinder). Our objective in the next sectionis to propose a pipeline that solves these two issues, being able to deal with unannotated genomes and todetect unknown ISs. For illustration purpose, the proposed pipeline system for IS elements prediction will be presented using 23complete genomes of

P. aeruginosa available on the NCBI website, RefSeq and INCDS/Genebank databases,see Table 1 (RefSeq genomes were prefered when available). The prediction of IS elements in the proposedpipeline depends on both OASIS [10] and ISFinder [12].

Pseudomonas aeruginosa

OASIS is used in this pipeline for predicting insertion sequences in prokaryotic genomes. This latter detectsISs in each genome by ﬁnding conserved regions surrounding already-annotated transposase genes, whichare identiﬁed by the word transposase in the “product” ﬁeld of the GenBank ﬁle. Obviously OASIS highlydepends on the quality of annotations [10], while to determine whether a given gene is a transposase or notis a very diﬃcult task (indeed transposases are among the most abundant and ubiquitous genes in nature [1],and they are widely separated in Prokaryote genomes). OASIS deals with ﬁles having genbank format. Ittakes them as input and then produces two output ﬁles for each provided genome. The ﬁrst one is a fastaﬁle that contains all IS nucleotide sequences, with start and end positions. It also contains the amino acidsequence for each ORF. The second ﬁle is a summary table providing attributes that describe the insertionsequence: set-id, family, group, IS positions, inverted repeat left (IRL) and right (IRR), and orientation.4emark that most of these information are in the ISFinder database too. Indeed OASIS ﬁnd them alonebut it extracts family names and group from ISFinder.The main problem found in OASIS is solved in the proposed pipeline by using diﬀerent types of annota-tions: NCBI will not be used alone, and gene functionality taken from annotation tools will either or not beused depending on the situation. Finally, transposases within IS will be veriﬁed using ISFinder database.OASIS can thus be used in two diﬀerent ways in our pipeline, depending on the provided genbank ﬁle. Thesetwo modules have been named NOASIS, which uses the original input genbank genome ﬁle provided by theNCBI (as it is, without any modiﬁcation), and DOASIS, which deals with modiﬁed genbank ﬁles that havebeen updated to obtain more accurate results than NOASIS. These modules are described thereafter.

For ﬁnding predicted IS in NOASIS module, we simply applied OASIS on the input set of genomes withtheir NCBI annotations, that is, with the original downloaded genbank ﬁle. Using the reference genomenamed PAO1, the summary outputted by the pipeline is given in Tables 2 and 3. In these NOASIS tables,the summary produced by OASIS is enriched with new features described below: • Real IS

IS sequences that have best match (ﬁrst hit) when using BLASTN with ISFinder database,an e-value equal to 0.0, and with a functionality of each ORF within the IS recognized as a transposase. • Partial IS

Sequences that match part of known IS from ISFinder ( i.e. , have e-value lower than 10 − )and have also a transposase gene functionality for the ORFs. • Putative New IS

Sequences with bad score after making a BLASTN with ISFinder, but with atransposase. They may be real insertion sequences not already added in ISFinder database or falsepositives, requiring human curation.Applying this slightly improved version of OASIS in the 23 genomes of

Pseudomonas leads to a majorissue: surprisingly, NOASIS found no real insertion sequences in some genomes like PACS2 or SCV20265.The problem is that OASIS ﬁnd multiple copies of IS elements in each genome by identifying conservedregions surrounding transposase genes. However some of the considered genomes either have no informationabout transposase gene into their feature genbank tables or have simply no feature table in their genbankformat ﬁles. This issue is at the basis of our improved module called DOASIS, which is explained below.For the sake of comparison, Figure 2 contains similar results for

Mycobacterium tuberculosis genus.Figure 2: IS elements detect in 28

Mycobacterium tuberculosis .3 Developed OASIS (DOASIS) The main idea for DOASIS module is that information about transposases within genbank ﬁles are potentiallyincorrect ( i.e. , may all be false positives). So we simply decide to remove all transposase words in the productﬁelds from all inputted genomes. We thus update these information as follows.

Step 1: genbank update.

Inputted genbank ﬁles are modiﬁed following one of the three methods below.1.

All-Tpase : we consider that all the genes may potentially be a transposase. So all product ﬁeldsare set to “transposase”.2.

Zigzag Odd : we suggest that genes in odd positions are putative transposases and we updatethe genbank ﬁle adequately. Oddly, this new path will produce new candidates which are notdetected during All-Tpase.3.

Zigzag Even : similar to Zigzag Odd, but on even positions.We checked also a randomized method ( i.e. , by putting “transposase” in randomly picked genes).However we found poorer number of predictive real ISs or new real ISs compared with the three methodspreviously presented. For these reasons, we will not further investigate the randomized method.Figure 3: Comparison of predicted ISs between randomization method and all/odd/even methods.

Step 2.

We apply OASIS three times ( i.e. , one time per method) on all genomes, and then we take theoutput fasta ﬁle that contains both nucleotides and amino acids sequences for each IS element.

Step 3.

A BLASTN with ISFinder is applied on each IS sequence. If the e-value of the ﬁrst hit is 0.0, thenthe ORF within this IS belongs to known (Real) IS already existing in the ISFinder database. Else, ifthe e-value is lower than 10 − , then we found a Partial IS. Step 4.

Collect all Real IS from previous three methods (ALL Tpase, Zigzag odd, and Zigzag even) andthen remove overlaps among them. Finally, produce best Real IS with all information. Remark thatthe problem of ﬁnding consensus and overlaps can be treated as a lexical parsing problem.

It is now possible to describe the proposed pipeline that can use the two modules detailed in the previoussection. This pipeline, depicted in Figure 4, will increase the number of Real IS detected on the set of

P.aeruginosa genomes under consideration (indeed, the detection is improved in all categories of insertionsequences, but we only focus on Real IS in the remainder of this article, for the sake of concision). Its stepsare detailed in what follows. 6igure 4: The proposed pipeline

Step 1: ORF identiﬁcation.

Our pipeline is currently compatible with any type of annotation tools,having either functionality capability or not, but for comparison we only focus in this article on the fol-lowing tools:

BASys , Prokka , and

Prodigal . BASys (Bacterial Annotation System) is a web server thatperforms automated, in-depth annotation of bacterial genomic (chromosomal and plasmid) sequences.It uses more than 30 programs to determine nearly 60 annotation subﬁelds for each gene. Remark thatgenomes must be sent online manually, and that some curation stage may be required to remove someDNA ambiguity on returned genbank ﬁles.Prokka (rapid prokaryotic genome annotation), for its part, is a classical command line software for fullyannotating draft bacterial genomes, producing standards-compliant output ﬁles for further analysis [11].Finally, Prodigal (Prokaryotic Dynamic Programming Geneﬁnding Algorithm) is an accurate bacterialand archaeal genes ﬁnding software provided by the Oak Ridge National Laboratory [7].

Step 2: IS Prediction.

The second stage of the pipeline consists in using either NOASIS or DOASIS forpredicting IS elements. Notice that NOASIS cannot be used with Prodigal, as this module requiresinformation about gene functionality (both NOASIS and DOASIS can be use with Prokka and BASysannotations).

Step 3: IS Validation.

This step is realized by launching BLASTN on each predicted IS sequence withISFinder. The e-value of the ﬁrst hit is then checked: if it is 0.0, then the ORF within this sequence isa Real IS known by ISFinder. As described previously, it will be considered as Partial IS if its e-valueis lower than 10 − . Both IS names of family and group are returned too.7igure 5: Comparison between Prokka, BASys, and NCBI functionality annotations We can ﬁrstly remark in Figure 5 that, using either Prokka or BASys for genes detection and functionalityprediction is better than taking directly the annotated genomes from NCBI: a larger number of Real IS canbe found. Additionally, this comparison shows that Prokka outperforms BASys in 3 families of ISs (namely:IS3, IS30, and ISNCY), while BASys seems better for detecting insertion sequences belonging in the IS5,IS1182, and TN3 families. This variability may be explained by the fact that functionality annotations ofthese tools depend probably on IS families that where known when these tools have been released.Table 2: Summary table produced by NOASIS (begining)

Name Genome Start End Orientation SetID ISFinder name Family Group Length

PAO1 NC 002516.2 499832 501193 - 1 ISPa11 IS110 IS1111 1361PAO1 NC 002516.2 2556875 2558236 + 1 ISPa11 IS110 IS1111 1361PAO1 NC 002516.2 3043478 3044839 - 1 ISPa11 IS110 IS1111 1361PAO1 NC 002516.2 3842002 3843363 - 1 ISPa11 IS110 IS1111 1361PAO1 NC 002516.2 4473550 4474911 + 1 ISPa11 IS110 IS1111 1361PAO1 NC 002516.2 5382524 5383885 - 1 ISPa11 IS110 IS1111 1361PAO1 NC 002516.2 54041 54835 + 2 ISStma5 IS3 IS3 794

Table 3: Summary table produced by NOASIS (end)

IRR=IRL Locus tag(gbk) Product(gbk) E Value IS type

ATGGACTCCTCCC [[’PA0445’]] [[’transposase’]] 0.0 Real ISATGGACTCCTCCC [[’PA2319’]] [[’transposase’]] 0.0 Real ISATGGACTCCTCCC [[’PA2690’]] [[’transposase’]] 0.0 Real ISATGGACTCCTCCC [[’PA3434’]] [[’transposase’]] 0.0 Real ISATGGACTCCTCCC [[’PA3993’]] [[’transposase’]] 0.0 Real ISATGGACTCCTCCC [[’PA4797’]] [[’transposase’]] 0.0 Real ISAAAGGGGACAGATTTATTTTCCCTGCTCTAAT [[’PA0041a’]] [[’transposase’]] 0.23 Putative New IS

The eﬀects of DOASIS module compared to single OASIS on annotated NCBI genomes are depicted inFigure 6. The improvement in real IS discovery is obvious, illustrating the low quality and inadequacy ofNCBI annotations for studying insertion sequences in bacterial genomes, and the improvements when usingour pipeline. This chart shows too that a zigzag path in the annotation can oddly improve the detection ofinsertion sequences.The prediction of real ISs is based on ﬁnding conserved regions ( i.e. , inverted repeats (IRs)) surroundedby transposase genes. Some ISs have been lost in All Tpase, for the following reason: when we suggested thatall genes are transposases, OASIS found predicted ISs that consist of large sets of transposases surroundedby IR in their left and right boundaries. But when these predicted ISs have been veriﬁed using ISFinderdatabase, we did not ﬁnd any good match. Contrarily, in Zigzag methods, good matches have been found8real ISs), because many of these elements consist of one or two transposase genes ﬂanked by IRs. Theseresults are listed with detail in Table 4 using BASys annotation tools.Figure 6: NOASIS (NCBI annotation) versus DOASISTable 4: BASys annotation using NOASIS and DOASIS

BASys Normal All Transpos Zigzag Odd Zigzag Even (All T/odd/even)

Name Genome Real IS Real IS Real IS Real IS

Best Real

PACS2 106896550 2 1 2 0 2PAO1 110645304 9 6 0 0 6UCBPP-PA14 116048575 3 8 8 1 8PA7 152983466 13 0 0 0 0LESB58 218888746 2 3 5 2 6M18 386056071 1 2 2 1 2NCGM2.S1 386062973 15 0 12 0 12DK2 392981410 8 9 10 8 11B136-33 478476202 3 5 3 3 519BR 485462089 5 0 0 10 10213BR 485462091 5 4 4 4 4RP73 514407635 4 5 5 2 5c7447m 543873856 9 0 9 9 10PAO581 543879514 9 6 8 0 8PAO1-VE2 553886202 8 6 9 6 9PAO1-VE13 553895034 8 6 8 8 8PA1R 558665962 4 4 4 5 5PA1 558672313 5 5 5 6 6MTB-1 564949884 0 1 0 1 1LES431 568151185 5 14 13 8 14SCV20265 568306739 5 14 13 8 14PA38182 575870901 1 3 1 2 4YL84 576902775 7 7 7 7 7131 109 128 91 157

We can thus wonder if the source of a wrong prediction of real IS is due to a wrong coding sequenceprediction, or to functionality errors. Switching between NOASIS and DOASIS allows us to answer thisquestion. We can conclude from Table 5 that (1) annotation errors are more frequent on NCBI, whileProkka annotates well the sequences related to ISs (see NOASIS columns), and that (2) both NCBI andProkka have a better coding sequence prediction than BASys, at least when considering sequences involvedin IS elements (see DOASIS columns and the correlation line). More precisely, the correlation is based onthe number of predicted real IS elements between NOASIS and DOASIS.9able 5: Correlation table for diﬀerent annotation tools

NCBI BASys ProkkaNOASIS DOASIS NOASIS DOASIS NOASIS DOASIS

Number of Real IS 110 169 131 157 169 176

Correlations

Prodigal has been studied separately, as it does not provide genes functionality. The number of Real ISsper genome returned by our pipeline using prodigal is given in Figure 7. As shown in Table 6, the qualityof coding sequences predicted with prodigal compared with other annotation tools allows us to discover thebest number of real ISs. In particular, we have improved a lot of results produced by OASIS and ISFinderon NCBI annotations, which is usually used in the literature that focuses on bacterial insertion sequences.Furthermore, this table illustrates a certain sensitivity of coding sequence prediction tools with functionalityannotation capabilities to detect ISs in some speciﬁc genomes like PA7. Indeed we discovered, during otherstudies we realized on this set of

Pseudomonas strains, that PA7 has a lot of speciﬁc genes, that is, whichare not in the core genome of all

Pseudomonases , which may explain such a sensitivity.Figure 7: Real ISs found by our pipeline using ProdigalTable 6: Final comparison using our pipeline

NCBI BASys Prokka Prodigal NCBI BASys Prokka ProdigalName DOASIS DOASIS DOASIS DOASIS Name DOASIS DOASIS DOASIS DOASIS

PACS2 0 2 2 3 c7447m 10 10 10 12PAO1 10 6 10 12 PAO581 10 8 10 8UCBPP-PA14 4 8 4 8 PAO1-VE2 10 9 10 9PA7 15 0 14 18 PAO1-VE13 10 8 10 9LESB58 3 6 3 6 PA1R 4 5 4 5M18 3 2 3 3 PA1 5 6 5 6NCGM2.S1 11 12 19 14 MTB-1 1 1 1 2DK2 12 11 13 17 LES431 6 14 6 6B136-33 5 5 5 7 SCV20265 15 14 15 1519BR 8 10 5 11 PA38182 7 4 7 7213BR 8 4 8 10 YL84 7 7 7 8RP73 5 5 5 5

Total IS 84 71 91 114

Insertion sequences of bacterial genomes are usually studied using OASIS and ISFinder on NCBI annotations.We have shown in this article that a pipeline can be designed to improve the accuracy of IS detection andclassiﬁcation by improving the coding sequence prediction stage, and by considering a priori each sequence10s a transposase. The source code for this pipeline can be download from the link . A comparison hasbeen conducted on a set of Pseudomonas aeruginosa , showing an obvious improvement in the detection ofinsertion sequences for some particular conﬁgurations of our pipeline.In future work, we intend to enlarge the number of coding sequence and functionality prediction toolsand to merge all the Real IS results in order to improve again the accuracy of our pipeline. We will thenfocus on the impact of IS elements in

P.aeruginosa evolution, comparing the phylogenetic tree of strains ofthis species with a phylogeny of their insertion sequences. Insertion events will then be investigated, andrelated to genomes rearrangements found in this collection of strains. We will ﬁnally enlarge our pipeline toeukariotic genomes and to other kind of transposable elements.

References [1] Ramy K Aziz, Mya Breitbart, and Robert A Edwards. Transposases are the most abundant, mostubiquitous genes in nature.

Nucleic acids research , 38(13):4207–4217, 2010.[2] Arthur L Delcher, Douglas Harmon, Simon Kasif, Owen White, and Steven L Salzberg. Improvedmicrobial gene identiﬁcation with glimmer.

Nucleic acids research , 27(23):4636–4641, 1999.[3] C´edric Feschotte and Ellen J Pritham. Computational analysis and paleogenomics of interspersedrepeats in eukaryotes.

Computational genomics: current methods , pages 31–54, 2007.[4] Morgan C Florek, Daniel P Gilbert, and Gordon R Plague. Insertion sequence distribution bias inarchaea.

Mobile Genetic Elements , 4(1):e27829, 2014.[5] Jennifer S Hawkins, HyeRan Kim, John D Nason, Rod A Wing, and Jonathan F Wendel. Diﬀeren-tial lineage-speciﬁc ampliﬁcation of transposable elements is responsible for genome size variation ingossypium.

Genome research , 16(10):1252–1261, 2006.[6] Alison Burgess Hickman, Michael Chandler, and Fred Dyda. Integrating prokaryotes and eukaryotes:Dna transposases in light of structure.

Critical reviews in biochemistry and molecular biology , 45(1):50–69, 2010.[7] Doug Hyatt, Gwo-Liang Chen, Philip F LoCascio, Miriam L Land, Frank W Larimer, and Loren JHauser. Prodigal: prokaryotic gene recognition and translation initiation site identiﬁcation.

BMCbioinformatics , 11(1):119, 2010.[8] Hirokazu Ochiai, Yasuhiro Inoue, Masaru Takeya, Aeni Sasaki, and Hisatoshi Kaku. Genome sequenceof xanthomonas oryzae pv. oryzae suggests contribution of large numbers of eﬀector genes and insertionsequences to its race diversity.

Japan Agricultural Research Quarterly , 39(4):275, 2005.[9] Gordon R Plague. Intergenic transposable elements are not randomly distributed in bacteria.

Genomebiology and evolution , 2:584–590, 2010.[10] David G Robinson, Ming-Chun Lee, and Christopher J Marx. Oasis: an automated program for globalinvestigation of bacterial and archaeal insertion sequences.

Nucleic acids research , 40(22):e174–e174,2012.[11] Torsten Seemann. Prokka: rapid prokaryotic genome annotation.

Bioinformatics , page btu153, 2014.[12] Patricia Siguier, Jocelyne P´erochon, L Lestrade, Jacques Mahillon, and Michael Chandler. Isﬁnder: thereference centre for bacterial insertion sequences.

Nucleic acids research , 34(suppl 1):D32–D36, 2006.[13] Marie Touchon and Eduardo PC Rocha. Causes of insertion sequences abundance in prokaryoticgenomes.

Molecular biology and evolution , 24(4):969–981, 2007. http://members.femto-st.fr/christophe-guyeux/en/insertion-sequences Genome Biol , 12(3):R30, 2011.[15] Andreas Wagner, Christopher Lewis, and Manuel Bichsel. A survey of bacterial insertion sequencesusing iscan.

Nucleic Acids Research , 35(16):5284–5293, 2007.[16] Tetsushi Yada and Makoto Hirosawa. Detection of short protein coding regions within the cyanobac-terium genome: application of the hidden markov model.

DNA Research , 3(6):355–361, 1996.[17] Fengfeng Zhou, Victor Olman, and Ying Xu. Insertion sequences show diverse recent activities incyanobacteria and archaea.