[PDF] Relation between Gene Content and Taxonomy in Chloroplasts

Abstract

The aim of this study is to investigate the relation that can be found between the phylogeny of a large set of complete chloroplast genomes, and the evolution of gene content inside these sequences. Core and pan genomes have been computed on \textit{de novo} annotation of these 845 genomes, the former being used for producing well-supported phylogenetic tree while the latter provides information regarding the evolution of gene contents over time. It details too the specificity of some branches of the tree, when specificity is obtained on accessory genes. After having detailed the material and methods, we emphasize some remarkable relation between well-known events of the chloroplast history, like endosymbiosis, and the evolution of gene contents over the phylogenetic tree.

Full PDF

RRelation between Gene Content and Taxonomyin Chloroplasts

Bashar Al-Nuaimi ∗† , Christophe Guyeux ∗ , Bassam AlKindy ‡ , Jean-François Couchot ∗ , and Michel Salomon ∗∗ FEMTO-ST Institute, UMR 6174 CNRS, DISC Computer Science DepartmentUniversité de Bourgogne Franche-Comté, France † Department of Computer Science, University of Diyala, Iraq ‡ Department of Computer Science, University of Mustansiriyah, [email protected]

Abstract

The aim of this study is to investigate therelation that can be found between the phylogeny ofa large set of complete chloroplast genomes, and theevolution of gene content inside these sequences. Coreand pan genomes have been computed on de novo annotation of these 845 genomes, the former being usedfor producing well-supported phylogenetic tree while thelatter provides information regarding the evolution ofgene contents over time. It details too the speciﬁcity ofsome branches of the tree, when speciﬁcity is obtainedon accessory genes. After having detailed the materialand methods, we emphasize some remarkable relationbetween well-known events of the chloroplast history,like endosymbiosis, and the evolution of gene contentsover the phylogenetic tree.

Index Terms —Chloroplasts, Phylogeny, Taxonomy, Coreand Pan genomes, Gene content

I. I

NTRODUCTION

Understanding the evolution of DNA molecules isa very complex problem, and no concrete and wellestablished solution exists at present regarding the caseof large DNA sequences. Our objective in this articleis to start to show that this complex problem can be(at least partially) solved when considering genomes ofreasonable size and who faced a rational number ofrecombination, like in the chloroplasts case. However,various difﬁculties remain to circumvent when dealingwith such a speciﬁc case, and solving them require thedesign of new ad hoc tools. Candidates for such toolsare presented in this article, and are applied on thechloroplast case.Chloroplasts are one of the numerous types of or-ganelles in the plant cell. The term of chloroplast comesfrom the combination of chloro and plastid, meaningthat it is an organelle found in plant cells that containsthe chlorophyll. Chloroplast has the ability to convertwater, light energy, and carbon dioxide ( CO ) intochemical energy by using carbon-ﬁxation cycle [1] (also TABLE I:

Information on chloroplast sizes at highest taxo-nomic level

Taxonomy nb. of min length max average standartgenomes length length deviationAlveolata 4 85535 140426 115714.2 19648.3Cryptophyta 2 121524 135854 128689.0 7165.0Euglenozoa 7 80147 143171 98548.7 19784.5Haptophyceae 3 95281 107461 102683.6 5307.6Rhodophyta 9 149987 217694 183755.5 18092.2Stramenopiles 35 89599 165809 124895.1 15138.0Viridiplantae 775 80211 289394 150194.9 20376.8 called

Calven Cycle , the whole process being called pho-tosynthesis). This pivotal role explains why chloroplastsare at the basis of most trophic chains and are thusresponsible for evolution and speciation.Consequently, investigating the evolutionary history ofchloroplasts is of great interest, and our long-term ob-jective is to explore it by the mean of ancestral genomesreconstruction. This reconstruction will be achieved inorder to discover how the molecules have evolved overtime, at which rate, and to determine whether this waycan present evidence of their cyanobacteria origin. Thislong-term objective necessitates numerous intermediateresearch advances. Among other things, it supposes tobe able to apply the ancestral reconstruction on a well-supported phylogenetic tree of a representative collectionof chloroplastic genomes. Indeed, sister relationship oftwo species must be clearly established before tryingto reconstruct their ancestor. Additionally, it impliesto be able to detect content evolution (modiﬁcation ofgenomes like gene loss and gain) along this accuratetree. In other words, gene content evolution on theone hand, and accurate phylogenetic inference on thecontrary, must be carefully regarded in the particularcase of chloroplast sequences , as the two most importantprerequisites in our quest of the last universal commonancestor of these chloroplasts.The objective of this research work is to make sig-niﬁcant progress in this quest, by providing material a r X i v : . [ q - b i o . GN ] S e p ABLE II:

Example of genomes information of

Strepto-phyta clade

Organism name Accession Sequence Nb ofnumber length CDS

Epimedium sagittatum

NC_029428.1 158273 85

Berberis bealei

NC_022457.1 164792 267

Torreya fargesii

NC_029398.1 137075 100

Lepidozamia peroffskyana

NC_027513.1 165939 93

Actinidia chinensis

NC_026690.1 156346 271

Quercus aliena

NC_026790.1 160921 259

Quercus aquifolioides

NC_026913.1 160415 176

Sedum sarmentosum

NC_023085.1 150448 99 and methods required in the study of chloroplastic se-quence evolution. Contributions of this article consistin the computation of core and pan genomes of the845 complete genomes available on the NCBI, in theproduction of a well-supported phylogenetic tree basedon core sequences as large as possible, and on the studyof the produced data. In particular, we start to emphasizesome links between the phylogenetic tree and evolutionof gene content.The paper is structured as follows. In the next section,material and methods applied in this study are presented,which encompass genome acquisition and annotation,core and pan genome analysis, and phylogenetic inves-tigations. Obtained results related to such analyzes aredetailed in Section III, on the chloroplast case. Thisarticle ends with a conclusion section, in which the studyis summarized and intended future work is outlined.II. M

ATERIALS AND METHODS

A. Data acquisition

A set of 845 chloroplastic genomes (green algae, redalgae, gymnosperms, and so on) has been downloadedfrom the NCBI website, representing all the availablecomplete genomes at the date of March, 2016 (seeTable I). An example of such sequences, taken fromthe

Streptophyta clade (a

Viridiplantae ), is provided inTable II. Note that this set does not really constitute avery balanced representation of the diversity of plants,as plants of particular and immediate interest to us like

Viridiplantae are ﬁrst sequenced. We must however dealwith such bias, as genomic data acquisition is most ofthe time human-centred. This set of sequences presentstoo a certain variability in terms of length, as detailed inTable I.Each genome has been annotated with DOGMA [2],an online automatic and accurate annotation tool of or-ganellar genomes, following a same approach than in [3].To apply it on our large scale, we have written (with theagreement of DOGMA authors) a script that automaticsend requests to the website. By doing such annotations,the same gene prediction and naming process has been

TABLE III:

Summarized properties of the pan genomes at thehighest taxonomic level.

Taxonomy

Nb. Min N.b Max N.b Average Nb.genomes of pan genes of pan genes of pan genes

Alveolata

Cryptophyta

Euglenozoa

Haptophyceae

Rhodophyta

Stramenopiles

35 73 271 238.971

Viridiplantae

775 85 271 229.827 applied with the same average quality of annotation. Inparticular, when a gene appears twice in the consideredset of genomes, it receives twice the same name (nospelling error). At this level, each genome is then de-scribed by an ordered list of gene names, with possibleduplications (other approaches for the annotation stageare possible, see, e.g. , [3]). This description will allow usto investigate, later in this article, the evolution of genecontent among the species tree, leading to the study ofcore and pan genomes recalled below.

B. Core and pan genome

Given a collection of genomes, it is possible to deﬁnetheir core genes as the common genes that are sharedamong all the species, while the pan genome is the unionof all the genes that are in at least one genome ( all the species have each core gene, while a pan gene isin at least one genome). Shared genes are evidences ofevolution from a common ancestor and of the relatednessof chloroplast organisms.To distinguish and determine the core genes maybe of importance either to identify the speciﬁcity andthe shared functionality of a given set of species, orto evaluate their phylogeny using the largest set ofshared coding sequences. In the case of chloroplasts,an important category of genome modiﬁcation is indeedthe loss of functional genes, either because they becomeineffective or due to transfer to the nucleus. Thereby, asmall number of gene loss among species may indicatethat these species are close to each other and belong toa similar lineage, while a signiﬁcant loss means distantlineages.So core genome is obviously of importance wheninferring the phylogenetic relationship, while accessorygenes of pan genome explain in some extend eachspecies speciﬁcity. We have formerly proposed threeapproaches for eliciting core genomes. The ﬁrst oneuses correlations computed on predicted coding se-quences [3], while the second one uses all the infor-mation provided during an accurate annotation stage [4].The third method takes the advantages from the ﬁrst twoapproaches, by considering gene information and DNAsequences, in order to ﬁnd the targeted core genome [5]. ig. 1:

Phylogenetic tree overview

We have found the core genome of each selected fam-ily by using the second method described in the previousparagraph [4]. Obtained results regarding gene contentare discussed in the next section. The core genomehas been used too for our phylogenetic investigation All data are available at...

Fig. 2:

The distributions of chloroplast genomes depending onthe genomes size. of chloroplast sequences, which has been applied asdescribed hereafter.

C. Phylogeny study

The next step when trying to reconstruct the evolutionof gene content over time is to deeply investigate thephylogeny of these chloroplasts, in order to obtain a treeas supported as possible. Indeed, a branching error inthe tree may lead to an erroneous transmission of anancestral state, which is dramatically perpetuated untilreaching the last universal common ancestor. However,as we considered all existing plant taxa, we facedchloroplastic sequences that have diverged a lot sincetwo billion of years, so the core genome of these 845sequences is very small when compared with sequencelength of each representative, and inferring a tree on sucha partial information will probably lead to numerouserrors.The approach that has been regarded in our study wasthen to group the plant families per close packets (samefamily in the taxonomy). Such grouping has enlargedthe number of shared gene sequences (core genes ofthe considered family) on which a more representativephylogeny can be computed [6]. After having alignedthe core genes of each family using MUSCLE [7] onour supercomputer facilities, we then have inferred aphylogenetic tree per family.To obtain such a tree, the RAxML [8], [9] program hasbeen employed to compute the phylogenetic maximum-likelihood (ML) function with the setup described here-after. General Time Reversible model of nucleotidesubstitution, with Γ model of rate heterogeneity andhill-climbing optimization method. The Prochlorococ-cus marinus (NC_009091.1) cyanobacteria species hasnally been chosen as outgroup, due to the supposedcyanobacteria origin of chloroplasts.After such a computing, if all bootstrap values arelarger than 95%, then we have consider that the phy-logeny is resolved, as the largest possible number ofgenes has led to a very well supported tree. In casewhere some branches are not supported, we can wonderwhether a few genes can be incriminated in this lackof assistance, for a large variety of reasons encompass-ing homoplasy, stochastic errors, undetected paralogy,incomplete lineage sorting, horizontal gene transfers,or even hybridization. Such problem has been resolvedby ﬁnding the largest subset of core genes leadingto the most supported tree, by the heuristic approachcoupled with statistical LASSO tests described in [6],[10]. Obtained trees are then merged on a well-supportedand representative supertree.III. O

BTAINED RESULTS

A. Phylogenetic investigations

The approach detailed in the previous section hasled to a well supported phylogenetic tree of the wholeavailable chloroplasts, with the ordered list of genes ateach leaf of the tree. An overview of the latter is providedin Figure 1. Obtained tree available too on our websiteis in general coherent with the NCBI taxonomy, exceptin some speciﬁc locations.By going into the details of the obtained tree, it iswell known that the ﬁrst plants endosymbiosis endedin a great diversiﬁcation of lineages comprising RedAlgae, Green Algae, and Land Plants (terrestrial). Theinteresting point in the production of our results is thatthe organisms resulting from the ﬁrst endosymbiosisare distributed in each of the lineages found in thechloroplast genome structure evolution as outlined inFigure 1.More precisely, all Red Algae chloroplasts aregrouped together in one lineage, while Green Algaeand Land Plant chloroplasts are all in a second lineage.Furthermore, organisms resulting from the secondaryendosymbioses, as listed in Table IV, are well local-ized in the tree: both the chloroplasts of Brown Algaeand

Dinoﬂagellates representatives are found exclusivelyin the lineage also comprising the Red Algae chloroplastsfrom which they evolved, while the

Euglens is related toGreen Algae from which they evolved. This latter makessense regarding biology, history of lineages, and theoriesof chloroplasts origins (and so photosynthetic ability) indifferent

Eucaryotic lineages [11].

B. Gene content

Let us now investigate the gene content level ofthe tree. Indeed, genes are rearranged in the genome by evolutionary events like insertion, deletion, trans-position, and inversion, which are called genome rear-rangements [12]. Such rearrangements can be studied,considering that we have both the gene contents and thephylogeny. A general overview of obtained results, interms of gene contents (pan genome) evolution at thetop taxonomy level, is provided in Table III, and it isdetailed for the following taxonomic level in Table IV.The core genome is constituted by 36 coding se-quences, namely:

ATPA, ATPB, ATPH, ATPI, PETB,PETG, PSAA, PSAB, PSAC, PSAJ, PSBA, PSBC,PSBD, PSBE, PSBF, PSBH, PSBI, PSBJ, PSBL, PSBN,PSBT, PSI_PSBT, RBCL, RPL14, RPL16, RPL2, RPL20,RPL36, RPS11, RPS12, RPS12_3END, RPS14, RPS19,RPS2, RPS7 , and

RRN16 . The pan genome of the wholeconsidered species, for its part, contains 268 genes. Notethat, according to our computation, no gene was speciﬁcto a given clade (that is, present in only one clade).

C. Relations between gene content and phylogeny

We then have further investigated the distribution ofnumber of genes according to the group of species.Obtained results are reproduced in Figures 2 and 3. Fourgroups have appeared among the 845 genomes, whichare taxonomically coherent. As shown in Fig. 3, thecluster of largest genomes has a number of genes rangingfrom 229 to 271, while in the group of smallest genomes,the lowest number of genes is for the

Viridiplantae case. In particular, among the genomes having less than120 genes, we found accession number NC_012903.1(

Eukaryota, Stramenopiles, Pelagophyceae, Pelagomon-adales, Aureoumbra lagunensis ), and 63

Spermatophyta species: 3

Pinidae , 58

Magnoliophyta , one

Cycadidae ,and ﬁnally one

Gnetidae . We ﬁnally obtain chloroplastgenomes varying from 73 to 271 genes.We can further note that (1) most of the organismsin green lineage (green algae and land plants) have alower number of genes in their chloroplasts comparedto the red algae. (2) Most land plants have genome sizesranging between 120 and 160 kb [13]. (3) Most of thedifferences in genome size are due to the number ofparalogous genes.When regarding more deeply the ordered list of genesto investigate the reasons of such differences of size, itappears to us that the gene content evolution can mostlybe explained by repetitions of some genes and the loss ofother ones: no large scale recombination is responsibleof such variations. Usual case is as in Figure 4 forACCA pan gene, on which single vulnerable genesare lost, possibly in various independent branches, dueto deletere mutations. Such results have been obtainedby comparing, for each couple of close genomes, allgene names and positions, by practicing a naked eyeinvestigation using homemade scripts. Some mutation

ABLE IV:

Taxonomy in the second level

Taxonomy

Nb. Min N.b Max N.b Avg N.bgenomes of pan genes of pan genes of pan genes

Alveolata Chromerida

Dinophyceae

Cryptophyta Pyrenomonadales

Euglenozoa Euglenida

Haptophyceae Phaeocystales

Isochrysidales

Pavlovales

Rhodophyta Bangiophyceae

Florideophyceae

Stramenopiles Px_clade

Bacillariophyta

20 138 271 231.35

Eustigmatophyceae

Raphidophyceae

Pelagophyceae

Viridiplantae Chlorophyta

58 156 271 244.517

Streptophyta

717 85 271 228.638

Fig. 3:

Classiﬁcation of chloroplast genomes according tonumbers of pan genes. and indel events are provided too in Table V, for thesake of illustration.IV. C

ONCLUSION

In this article, we made signiﬁcant progress in thestudy of chloroplastic sequence evolution, by providingmaterial and methods required in the quest of the ances-tral genome of the chloroplasts. A large set of completechloroplast genomes has been studied de novo regardingboth core and pan genomes, phylogenetic relationship,and gene content modiﬁcations. We then started to studythe produced data, by emphasizing some remarkablerelations between well-known events of the chloroplasthistory and the evolution of gene contents over thephylogenetic tree. In future work, our intention is to investigate more sys-tematically such relations between remarkable ancestralnodes in the tree, endosymbiosis events, and evolutionof gene content. We will wonder whether some branchesof the trees are statistically remarkable when consideringgene content (for instance, do we have a correlationbetween the presence or absence of a subset of genes,and a particular taxonomy). Then, the gene orderingand content of each ancestral node will be computedusing ad hoc algorithms, ancestral DNA sequences willbe inferred, and ancestral intergenic regions will bededuced, in order to have all ancestral genomes withconﬁdence indications like probabilities. The producedancestral genomes will then be used to investigate hy-potheses formulated by biologists, regarding the origin ofchloroplasts, their recombination events, and the transferof some material to the nucleus. We will in particularstudy whether recombination events where uniform overtime and on the whole sequence, or if it is possible tohighlight some hot spots of recombination in the historyof these chloroplasts.

All computations have been performed using the “Mé-socentre de calcul de l’Université de Franche-Comté”supercomputer facilities. R EFERENCES[1] J. Lewis M. Raff K. Roberts B. Alberts, A. Johnson, JohnH. Wilson P. Walter, and Tim Hunt. Molecular biology of the cell.

Biochemistry and molecular biology education , 31(3):212–213,2003.[2] Stacia K. Wyman, Robert K. Jansen, and Jeffrey L. Boore.Automatic annotation of organellar genomes with dogma.

BIOIN-FORMATICS, Oxford Press , 20(172004):3252–3255, 2004.[3] Bassam Alkindy, Jean-François Couchot, Christophe Guyeux,Arnaud Mouly, Michel Salomon, and Jacques M. Bahi. Findingthe core-genes of chloroplasts.

Journal of Bioscience, Biochemis-tery, and Bioinformatics , 4(5):357–364, 2014.

ABLE V:

Example of comparison between pairwise genomes from various species, to investigate the changes that occurredwithin branches of the tree.

Index Clade Sub-kingdom Order/ Family Genomename N.b ofpan genes Deletion/Insertion Matchingratio

Camelina

Viridiplantae Embryophyta Camelineae Barbarea

267 173/0 48.46

Aquilaria

Viridiplantae Embryophyta Camelineae Hibiscus

267 164/0 28.26

Acer

Viridiplantae Embryophyta Sapindales Azadirachta

267 173/0 49.02

Lepidozamia

Viridiplantae Embryophyta Zamiaceae Zamia

267 172/0 51.66

Lavanduleae

Viridiplantae Embryophyta Lamiaceae Perman

267 173/0 49.91

Aureoumbra

Stramenopiles Pelagophyceae Pelagomonadales Aureococcus

267 193/0 27.13

Epimedium

Viridiplantae Eudicotyledons Berberidoideae berberis

267 180/0 33.52

NC_026690_1

Viridiplantae Eudicotyledons Actinidia NC_026691_1

271 174/0 48.63

Fig. 4:

ACCA gene loss in various branches of the tree [4] Bassam AlKindy, Christophe Guyeux, Jean-François Couchot,Michel Salomon, and Jacques M Bahi. Gene similarity-basedapproaches for determining core-genes of chloroplasts. In

Bioin-formatics and Biomedicine (BIBM), 2014 IEEE InternationalConference on , pages 71–74. IEEE, 2014.[5] Bassam AlKindy, Huda Al-Nayyef, Christophe Guyeux, Jean-François Couchot, Michel Salomon, and Jacques M. Bahi. Im-proved core genes prediction for constructing well-supportedphylogenetic trees in large sets of plant species. In FranciscoOrtuño and Ignacio Rojas, editors,

Bioinformatics and Biomed-ical Engineering , volume 9043 of

Lecture Notes in ComputerScience , pages 379–390. Springer International Publishing, 2015.[6] Bassam AlKindy, Christophe Guyeux, Jean-François Couchot,Michel Salomon, Christian Parisod, and Jacques M. Bahi. Hybridgenetic algorithm and lasso test approach for inferring wellsupported phylogenetic trees based on subsets of chloroplasticcore genes.

CoRR , abs/1504.05095, 2015.[7] Robert C Edgar. Muscle: multiple sequence alignment withhigh accuracy and high throughput.

Nucleic acids research ,32(5):1792–1797, 2004.[8] Alexandros Stamatakis. Raxml version 8: a tool for phylogeneticanalysis and post-analysis of large phylogenies.

Bioinformatics ,page btu033, 2014.[9] Jeffrey Rizzo and Eric C Rouchka. Review of phylogenetic treeconstruction.

University of Louisville Bioinformatics Laboratory Technical Report Series , pages 2–7, 2007.[10] Reem Alsrraj, Bassam AlKindy, Christophe Guyeux, LaurentPhilippe, and Jean-François Couchot. Well-supported phyloge-nies using largest subsets of core-genes by discrete particle swarmoptimization.

Proceedings of CIBB , 2:1, 2015.[11] Xi Li, Ti-Cao Zhang, Qin Qiao, Zhumei Ren, Jiayuan Zhao,Takahiro Yonezawa, Masami Hasegawa, M James C Crabbe,Jianqiang Li, and Yang Zhong. Complete chloroplast genome se-quence of holoparasite cistanche deserticola (orobanchaceae) re-veals gene loss and horizontal gene transfer from its host haloxy-lon ammodendron (chenopodiaceae).

PloS one , 8(3):e58747,2013.[12] Mikita Suyama and Peer Bork. Evolution of prokaryotic geneorder: genome rearrangements in closely related species.

Trendsin Genetics , 17(1):10–13, 2001.[13] Beverley R Green. Chloroplast genomes of photosyntheticeukaryotes.