Relation between Gene Content and Taxonomy in Chloroplasts
Bashar Al-Nuaimi, Christophe Guyeux, Bassam AlKindy, Jean-François Couchot, Michel Salomon
RRelation between Gene Content and Taxonomyin Chloroplasts
Bashar Al-Nuaimi ∗† , Christophe Guyeux ∗ , Bassam AlKindy ‡ , Jean-François Couchot ∗ , and Michel Salomon ∗∗ FEMTO-ST Institute, UMR 6174 CNRS, DISC Computer Science DepartmentUniversité de Bourgogne Franche-Comté, France † Department of Computer Science, University of Diyala, Iraq ‡ Department of Computer Science, University of Mustansiriyah, [email protected]
Abstract
The aim of this study is to investigate therelation that can be found between the phylogeny ofa large set of complete chloroplast genomes, and theevolution of gene content inside these sequences. Coreand pan genomes have been computed on de novo annotation of these 845 genomes, the former being usedfor producing well-supported phylogenetic tree while thelatter provides information regarding the evolution ofgene contents over time. It details too the specificity ofsome branches of the tree, when specificity is obtainedon accessory genes. After having detailed the materialand methods, we emphasize some remarkable relationbetween well-known events of the chloroplast history,like endosymbiosis, and the evolution of gene contentsover the phylogenetic tree.
Index Terms —Chloroplasts, Phylogeny, Taxonomy, Coreand Pan genomes, Gene content
I. I
NTRODUCTION
Understanding the evolution of DNA molecules isa very complex problem, and no concrete and wellestablished solution exists at present regarding the caseof large DNA sequences. Our objective in this articleis to start to show that this complex problem can be(at least partially) solved when considering genomes ofreasonable size and who faced a rational number ofrecombination, like in the chloroplasts case. However,various difficulties remain to circumvent when dealingwith such a specific case, and solving them require thedesign of new ad hoc tools. Candidates for such toolsare presented in this article, and are applied on thechloroplast case.Chloroplasts are one of the numerous types of or-ganelles in the plant cell. The term of chloroplast comesfrom the combination of chloro and plastid, meaningthat it is an organelle found in plant cells that containsthe chlorophyll. Chloroplast has the ability to convertwater, light energy, and carbon dioxide ( CO ) intochemical energy by using carbon-fixation cycle [1] (also TABLE I:
Information on chloroplast sizes at highest taxo-nomic level
Taxonomy nb. of min length max average standartgenomes length length deviationAlveolata 4 85535 140426 115714.2 19648.3Cryptophyta 2 121524 135854 128689.0 7165.0Euglenozoa 7 80147 143171 98548.7 19784.5Haptophyceae 3 95281 107461 102683.6 5307.6Rhodophyta 9 149987 217694 183755.5 18092.2Stramenopiles 35 89599 165809 124895.1 15138.0Viridiplantae 775 80211 289394 150194.9 20376.8 called
Calven Cycle , the whole process being called pho-tosynthesis). This pivotal role explains why chloroplastsare at the basis of most trophic chains and are thusresponsible for evolution and speciation.Consequently, investigating the evolutionary history ofchloroplasts is of great interest, and our long-term ob-jective is to explore it by the mean of ancestral genomesreconstruction. This reconstruction will be achieved inorder to discover how the molecules have evolved overtime, at which rate, and to determine whether this waycan present evidence of their cyanobacteria origin. Thislong-term objective necessitates numerous intermediateresearch advances. Among other things, it supposes tobe able to apply the ancestral reconstruction on a well-supported phylogenetic tree of a representative collectionof chloroplastic genomes. Indeed, sister relationship oftwo species must be clearly established before tryingto reconstruct their ancestor. Additionally, it impliesto be able to detect content evolution (modification ofgenomes like gene loss and gain) along this accuratetree. In other words, gene content evolution on theone hand, and accurate phylogenetic inference on thecontrary, must be carefully regarded in the particularcase of chloroplast sequences , as the two most importantprerequisites in our quest of the last universal commonancestor of these chloroplasts.The objective of this research work is to make sig-nificant progress in this quest, by providing material a r X i v : . [ q - b i o . GN ] S e p ABLE II:
Example of genomes information of
Strepto-phyta clade
Organism name Accession Sequence Nb ofnumber length CDS
Epimedium sagittatum
NC_029428.1 158273 85
Berberis bealei
NC_022457.1 164792 267
Torreya fargesii
NC_029398.1 137075 100
Lepidozamia peroffskyana
NC_027513.1 165939 93
Actinidia chinensis
NC_026690.1 156346 271
Quercus aliena
NC_026790.1 160921 259
Quercus aquifolioides
NC_026913.1 160415 176
Sedum sarmentosum
NC_023085.1 150448 99 and methods required in the study of chloroplastic se-quence evolution. Contributions of this article consistin the computation of core and pan genomes of the845 complete genomes available on the NCBI, in theproduction of a well-supported phylogenetic tree basedon core sequences as large as possible, and on the studyof the produced data. In particular, we start to emphasizesome links between the phylogenetic tree and evolutionof gene content.The paper is structured as follows. In the next section,material and methods applied in this study are presented,which encompass genome acquisition and annotation,core and pan genome analysis, and phylogenetic inves-tigations. Obtained results related to such analyzes aredetailed in Section III, on the chloroplast case. Thisarticle ends with a conclusion section, in which the studyis summarized and intended future work is outlined.II. M
ATERIALS AND METHODS
A. Data acquisition
A set of 845 chloroplastic genomes (green algae, redalgae, gymnosperms, and so on) has been downloadedfrom the NCBI website, representing all the availablecomplete genomes at the date of March, 2016 (seeTable I). An example of such sequences, taken fromthe
Streptophyta clade (a
Viridiplantae ), is provided inTable II. Note that this set does not really constitute avery balanced representation of the diversity of plants,as plants of particular and immediate interest to us like
Viridiplantae are first sequenced. We must however dealwith such bias, as genomic data acquisition is most ofthe time human-centred. This set of sequences presentstoo a certain variability in terms of length, as detailed inTable I.Each genome has been annotated with DOGMA [2],an online automatic and accurate annotation tool of or-ganellar genomes, following a same approach than in [3].To apply it on our large scale, we have written (with theagreement of DOGMA authors) a script that automaticsend requests to the website. By doing such annotations,the same gene prediction and naming process has been
TABLE III:
Summarized properties of the pan genomes at thehighest taxonomic level.
Taxonomy
Nb. Min N.b Max N.b Average Nb.genomes of pan genes of pan genes of pan genes
Alveolata
Cryptophyta
Euglenozoa
Haptophyceae
Rhodophyta
Stramenopiles
35 73 271 238.971
Viridiplantae
775 85 271 229.827 applied with the same average quality of annotation. Inparticular, when a gene appears twice in the consideredset of genomes, it receives twice the same name (nospelling error). At this level, each genome is then de-scribed by an ordered list of gene names, with possibleduplications (other approaches for the annotation stageare possible, see, e.g. , [3]). This description will allow usto investigate, later in this article, the evolution of genecontent among the species tree, leading to the study ofcore and pan genomes recalled below.
B. Core and pan genome
Given a collection of genomes, it is possible to definetheir core genes as the common genes that are sharedamong all the species, while the pan genome is the unionof all the genes that are in at least one genome ( all the species have each core gene, while a pan gene isin at least one genome). Shared genes are evidences ofevolution from a common ancestor and of the relatednessof chloroplast organisms.To distinguish and determine the core genes maybe of importance either to identify the specificity andthe shared functionality of a given set of species, orto evaluate their phylogeny using the largest set ofshared coding sequences. In the case of chloroplasts,an important category of genome modification is indeedthe loss of functional genes, either because they becomeineffective or due to transfer to the nucleus. Thereby, asmall number of gene loss among species may indicatethat these species are close to each other and belong toa similar lineage, while a significant loss means distantlineages.So core genome is obviously of importance wheninferring the phylogenetic relationship, while accessorygenes of pan genome explain in some extend eachspecies specificity. We have formerly proposed threeapproaches for eliciting core genomes. The first oneuses correlations computed on predicted coding se-quences [3], while the second one uses all the infor-mation provided during an accurate annotation stage [4].The third method takes the advantages from the first twoapproaches, by considering gene information and DNAsequences, in order to find the targeted core genome [5]. ig. 1:
Phylogenetic tree overview
We have found the core genome of each selected fam-ily by using the second method described in the previousparagraph [4]. Obtained results regarding gene contentare discussed in the next section. The core genomehas been used too for our phylogenetic investigation All data are available at...
Fig. 2:
The distributions of chloroplast genomes depending onthe genomes size. of chloroplast sequences, which has been applied asdescribed hereafter.
C. Phylogeny study
The next step when trying to reconstruct the evolutionof gene content over time is to deeply investigate thephylogeny of these chloroplasts, in order to obtain a treeas supported as possible. Indeed, a branching error inthe tree may lead to an erroneous transmission of anancestral state, which is dramatically perpetuated untilreaching the last universal common ancestor. However,as we considered all existing plant taxa, we facedchloroplastic sequences that have diverged a lot sincetwo billion of years, so the core genome of these 845sequences is very small when compared with sequencelength of each representative, and inferring a tree on sucha partial information will probably lead to numerouserrors.The approach that has been regarded in our study wasthen to group the plant families per close packets (samefamily in the taxonomy). Such grouping has enlargedthe number of shared gene sequences (core genes ofthe considered family) on which a more representativephylogeny can be computed [6]. After having alignedthe core genes of each family using MUSCLE [7] onour supercomputer facilities, we then have inferred aphylogenetic tree per family.To obtain such a tree, the RAxML [8], [9] program hasbeen employed to compute the phylogenetic maximum-likelihood (ML) function with the setup described here-after. General Time Reversible model of nucleotidesubstitution, with Γ model of rate heterogeneity andhill-climbing optimization method. The Prochlorococ-cus marinus (NC_009091.1) cyanobacteria species hasnally been chosen as outgroup, due to the supposedcyanobacteria origin of chloroplasts.After such a computing, if all bootstrap values arelarger than 95%, then we have consider that the phy-logeny is resolved, as the largest possible number ofgenes has led to a very well supported tree. In casewhere some branches are not supported, we can wonderwhether a few genes can be incriminated in this lackof assistance, for a large variety of reasons encompass-ing homoplasy, stochastic errors, undetected paralogy,incomplete lineage sorting, horizontal gene transfers,or even hybridization. Such problem has been resolvedby finding the largest subset of core genes leadingto the most supported tree, by the heuristic approachcoupled with statistical LASSO tests described in [6],[10]. Obtained trees are then merged on a well-supportedand representative supertree.III. O
BTAINED RESULTS
A. Phylogenetic investigations
The approach detailed in the previous section hasled to a well supported phylogenetic tree of the wholeavailable chloroplasts, with the ordered list of genes ateach leaf of the tree. An overview of the latter is providedin Figure 1. Obtained tree available too on our websiteis in general coherent with the NCBI taxonomy, exceptin some specific locations.By going into the details of the obtained tree, it iswell known that the first plants endosymbiosis endedin a great diversification of lineages comprising RedAlgae, Green Algae, and Land Plants (terrestrial). Theinteresting point in the production of our results is thatthe organisms resulting from the first endosymbiosisare distributed in each of the lineages found in thechloroplast genome structure evolution as outlined inFigure 1.More precisely, all Red Algae chloroplasts aregrouped together in one lineage, while Green Algaeand Land Plant chloroplasts are all in a second lineage.Furthermore, organisms resulting from the secondaryendosymbioses, as listed in Table IV, are well local-ized in the tree: both the chloroplasts of Brown Algaeand
Dinoflagellates representatives are found exclusivelyin the lineage also comprising the Red Algae chloroplastsfrom which they evolved, while the
Euglens is related toGreen Algae from which they evolved. This latter makessense regarding biology, history of lineages, and theoriesof chloroplasts origins (and so photosynthetic ability) indifferent
Eucaryotic lineages [11].
B. Gene content
Let us now investigate the gene content level ofthe tree. Indeed, genes are rearranged in the genome by evolutionary events like insertion, deletion, trans-position, and inversion, which are called genome rear-rangements [12]. Such rearrangements can be studied,considering that we have both the gene contents and thephylogeny. A general overview of obtained results, interms of gene contents (pan genome) evolution at thetop taxonomy level, is provided in Table III, and it isdetailed for the following taxonomic level in Table IV.The core genome is constituted by 36 coding se-quences, namely:
ATPA, ATPB, ATPH, ATPI, PETB,PETG, PSAA, PSAB, PSAC, PSAJ, PSBA, PSBC,PSBD, PSBE, PSBF, PSBH, PSBI, PSBJ, PSBL, PSBN,PSBT, PSI_PSBT, RBCL, RPL14, RPL16, RPL2, RPL20,RPL36, RPS11, RPS12, RPS12_3END, RPS14, RPS19,RPS2, RPS7 , and
RRN16 . The pan genome of the wholeconsidered species, for its part, contains 268 genes. Notethat, according to our computation, no gene was specificto a given clade (that is, present in only one clade).
C. Relations between gene content and phylogeny
We then have further investigated the distribution ofnumber of genes according to the group of species.Obtained results are reproduced in Figures 2 and 3. Fourgroups have appeared among the 845 genomes, whichare taxonomically coherent. As shown in Fig. 3, thecluster of largest genomes has a number of genes rangingfrom 229 to 271, while in the group of smallest genomes,the lowest number of genes is for the
Viridiplantae case. In particular, among the genomes having less than120 genes, we found accession number NC_012903.1(
Eukaryota, Stramenopiles, Pelagophyceae, Pelagomon-adales, Aureoumbra lagunensis ), and 63
Spermatophyta species: 3
Pinidae , 58
Magnoliophyta , one
Cycadidae ,and finally one
Gnetidae . We finally obtain chloroplastgenomes varying from 73 to 271 genes.We can further note that (1) most of the organismsin green lineage (green algae and land plants) have alower number of genes in their chloroplasts comparedto the red algae. (2) Most land plants have genome sizesranging between 120 and 160 kb [13]. (3) Most of thedifferences in genome size are due to the number ofparalogous genes.When regarding more deeply the ordered list of genesto investigate the reasons of such differences of size, itappears to us that the gene content evolution can mostlybe explained by repetitions of some genes and the loss ofother ones: no large scale recombination is responsibleof such variations. Usual case is as in Figure 4 forACCA pan gene, on which single vulnerable genesare lost, possibly in various independent branches, dueto deletere mutations. Such results have been obtainedby comparing, for each couple of close genomes, allgene names and positions, by practicing a naked eyeinvestigation using homemade scripts. Some mutation
ABLE IV:
Taxonomy in the second level
Taxonomy
Nb. Min N.b Max N.b Avg N.bgenomes of pan genes of pan genes of pan genes
Alveolata Chromerida
Dinophyceae
Cryptophyta Pyrenomonadales
Euglenozoa Euglenida
Haptophyceae Phaeocystales
Isochrysidales
Pavlovales
Rhodophyta Bangiophyceae
Florideophyceae
Stramenopiles Px_clade
Bacillariophyta
20 138 271 231.35
Eustigmatophyceae
Raphidophyceae
Pelagophyceae
Viridiplantae Chlorophyta
58 156 271 244.517
Streptophyta
717 85 271 228.638
Fig. 3:
Classification of chloroplast genomes according tonumbers of pan genes. and indel events are provided too in Table V, for thesake of illustration.IV. C
ONCLUSION
In this article, we made significant progress in thestudy of chloroplastic sequence evolution, by providingmaterial and methods required in the quest of the ances-tral genome of the chloroplasts. A large set of completechloroplast genomes has been studied de novo regardingboth core and pan genomes, phylogenetic relationship,and gene content modifications. We then started to studythe produced data, by emphasizing some remarkablerelations between well-known events of the chloroplasthistory and the evolution of gene contents over thephylogenetic tree. In future work, our intention is to investigate more sys-tematically such relations between remarkable ancestralnodes in the tree, endosymbiosis events, and evolutionof gene content. We will wonder whether some branchesof the trees are statistically remarkable when consideringgene content (for instance, do we have a correlationbetween the presence or absence of a subset of genes,and a particular taxonomy). Then, the gene orderingand content of each ancestral node will be computedusing ad hoc algorithms, ancestral DNA sequences willbe inferred, and ancestral intergenic regions will bededuced, in order to have all ancestral genomes withconfidence indications like probabilities. The producedancestral genomes will then be used to investigate hy-potheses formulated by biologists, regarding the origin ofchloroplasts, their recombination events, and the transferof some material to the nucleus. We will in particularstudy whether recombination events where uniform overtime and on the whole sequence, or if it is possible tohighlight some hot spots of recombination in the historyof these chloroplasts.
All computations have been performed using the “Mé-socentre de calcul de l’Université de Franche-Comté”supercomputer facilities. R EFERENCES[1] J. Lewis M. Raff K. Roberts B. Alberts, A. Johnson, JohnH. Wilson P. Walter, and Tim Hunt. Molecular biology of the cell.
Biochemistry and molecular biology education , 31(3):212–213,2003.[2] Stacia K. Wyman, Robert K. Jansen, and Jeffrey L. Boore.Automatic annotation of organellar genomes with dogma.
BIOIN-FORMATICS, Oxford Press , 20(172004):3252–3255, 2004.[3] Bassam Alkindy, Jean-François Couchot, Christophe Guyeux,Arnaud Mouly, Michel Salomon, and Jacques M. Bahi. Findingthe core-genes of chloroplasts.
Journal of Bioscience, Biochemis-tery, and Bioinformatics , 4(5):357–364, 2014.
ABLE V:
Example of comparison between pairwise genomes from various species, to investigate the changes that occurredwithin branches of the tree.
Index Clade Sub-kingdom Order/ Family Genomename N.b ofpan genes Deletion/Insertion Matchingratio
Camelina
Viridiplantae Embryophyta Camelineae Barbarea
267 173/0 48.46
Aquilaria
Viridiplantae Embryophyta Camelineae Hibiscus
267 164/0 28.26
Acer
Viridiplantae Embryophyta Sapindales Azadirachta
267 173/0 49.02
Lepidozamia
Viridiplantae Embryophyta Zamiaceae Zamia
267 172/0 51.66
Lavanduleae
Viridiplantae Embryophyta Lamiaceae Perman
267 173/0 49.91
Aureoumbra
Stramenopiles Pelagophyceae Pelagomonadales Aureococcus
267 193/0 27.13
Epimedium
Viridiplantae Eudicotyledons Berberidoideae berberis
267 180/0 33.52
NC_026690_1
Viridiplantae Eudicotyledons Actinidia NC_026691_1
271 174/0 48.63
Fig. 4:
ACCA gene loss in various branches of the tree [4] Bassam AlKindy, Christophe Guyeux, Jean-François Couchot,Michel Salomon, and Jacques M Bahi. Gene similarity-basedapproaches for determining core-genes of chloroplasts. In
Bioin-formatics and Biomedicine (BIBM), 2014 IEEE InternationalConference on , pages 71–74. IEEE, 2014.[5] Bassam AlKindy, Huda Al-Nayyef, Christophe Guyeux, Jean-François Couchot, Michel Salomon, and Jacques M. Bahi. Im-proved core genes prediction for constructing well-supportedphylogenetic trees in large sets of plant species. In FranciscoOrtuño and Ignacio Rojas, editors,
Bioinformatics and Biomed-ical Engineering , volume 9043 of
Lecture Notes in ComputerScience , pages 379–390. Springer International Publishing, 2015.[6] Bassam AlKindy, Christophe Guyeux, Jean-François Couchot,Michel Salomon, Christian Parisod, and Jacques M. Bahi. Hybridgenetic algorithm and lasso test approach for inferring wellsupported phylogenetic trees based on subsets of chloroplasticcore genes.
CoRR , abs/1504.05095, 2015.[7] Robert C Edgar. Muscle: multiple sequence alignment withhigh accuracy and high throughput.
Nucleic acids research ,32(5):1792–1797, 2004.[8] Alexandros Stamatakis. Raxml version 8: a tool for phylogeneticanalysis and post-analysis of large phylogenies.
Bioinformatics ,page btu033, 2014.[9] Jeffrey Rizzo and Eric C Rouchka. Review of phylogenetic treeconstruction.
University of Louisville Bioinformatics Laboratory Technical Report Series , pages 2–7, 2007.[10] Reem Alsrraj, Bassam AlKindy, Christophe Guyeux, LaurentPhilippe, and Jean-François Couchot. Well-supported phyloge-nies using largest subsets of core-genes by discrete particle swarmoptimization.
Proceedings of CIBB , 2:1, 2015.[11] Xi Li, Ti-Cao Zhang, Qin Qiao, Zhumei Ren, Jiayuan Zhao,Takahiro Yonezawa, Masami Hasegawa, M James C Crabbe,Jianqiang Li, and Yang Zhong. Complete chloroplast genome se-quence of holoparasite cistanche deserticola (orobanchaceae) re-veals gene loss and horizontal gene transfer from its host haloxy-lon ammodendron (chenopodiaceae).
PloS one , 8(3):e58747,2013.[12] Mikita Suyama and Peer Bork. Evolution of prokaryotic geneorder: genome rearrangements in closely related species.
Trendsin Genetics , 17(1):10–13, 2001.[13] Beverley R Green. Chloroplast genomes of photosyntheticeukaryotes.