DrosOCB: a high resolution map of conserved non coding sequences in Drosophila
aa r X i v : . [ q - b i o . GN ] O c t DFTT 4/07
DrosOCB: a high resolution map of conserved noncoding sequences in Drosophila
Loredana Martignetti , , Michele Caselle , , Bernard Jacq and Carl Herrmann
1. Dipartimento di Fisica Teorica dell’Universit`a di Torino and I.N.F.N.,Via Pietro Giuria 1, I-10125 Torino, Italy2. Centre on Complex Systems in Biology and Molecular Medicine,Via Accademia Albertina 13, I-10125 Torino, Italye–mail: (caselle)(martigne)@to.infn.it3. Institut de Biologie du Dveloppement de Marseille-Luminy,UMR6216 CNRS, Case 907, Parc Scientifique et Technologique de Luminy,13288 Marseille, Cedex 9, Francee–mail: (jacq)(herrmann)@ibdm.univ-mrs.fr
Abstract
Comparative genomics methods are widely used to aid the func-tional annotation of non coding DNA regions. However, aligning noncoding sequences requires new algorithms and strategies, in order totake into account extensive rearrangements and turnover during evo-lution. Here we present a novel large scale alignment strategy whichaims at drawing a precise map of conserved non coding regions be-tween genomes, even when these regions have undergone small scalerearrangments events and a certain degree of sequence variability.We applied our alignment approach to obtain a genome–wide cat-alogue of conserved non coding blocks (CNBs) between Drosophilamelanogaster and 11 other Drosophila species. Interestingly, we ob-serve numerous small scale rearrangement events, such as local inver-sions, duplications and translocations, which are not observable in thewhole genome alignments currently available.The high rate of observed low scale reshuffling show that this databaseof CNBs can constitute the starting point for several investigations, re-lated to the evolution of regulatory DNA in Drosophila and the in silicoidentification of unannotated functional elements.
Background
The functional annotation of eukaryotic DNA sequences represents a greatchallenge in post–genomic biological research. The identification of func-tional non–coding elements, such as untranslated regions (UTRs), genes fornon–protein-coding RNAs, and cis-regulatory elements, is extremely diffi-cult, as the rules governing their structure and function are far from beingwell undertood.A great aid to functional annotation of genome sequences is provided bycomparative genomics methods which, since a few years, have been extendedalso to non coding DNA regions. The basic assumption of comparativegenomic approach is that common features of two organisms are encodedwithin the DNA that is conserved between the species, due to purifyingselection during evolution. According to the same assumption, the DNAsequences controlling the expression of genes that are regulated similarly intwo related species should also be selected during evolution.However, comparison of non coding sequences requires new algorithmsand strategies to take into account the different evolutionary mechanismsaffecting regulatory sequences. Recent studies examining the evolution ofcis–regulatory modules in Drosophila, reveals that regulatory sequences mayfrequently evolve through compensatory gain and loss events in transcrip-tion factors binding sites, that produces little functional change [1], [2].Great plasticity in the arrangement of binding sites within cis–regulatorymodules is another remarkable evolutionary feature revealed to occur invertebrates [3].Once complete genomes from different species are available, a globalalignment procedure is suitable to find a map of colinear conserved segmentsbetween the input sequences, descarding alignments that overlap or crossover. Global alignment methods are widely used to identify highly similarregions in the sequences which appear in the same order and orientation.On the contrary, local alignment algorithms are generally very useful infinding similarity between regions that may be related but are inverted orrearranged with respect to each other.Recently, the novel notion of glocal alignment, a sophisticated combi-nation of global and local methods, has been introduced [5]. This class ofalignment algorithms create a map that transforms one sequence into theother while allowing for rearrangement events. This procedure, at the base1f Shuffled-LAGAN algorithm [6], is able to take into account large scalegenomic rearrangments, but fails at lower scale.Here, we present an novel large scale alignment strategy which aims atdrawing a precise map of conserved non-coding regions between genomes,even when these regions have undergone small scale rearrangement events.Our procedure is optimized to take into account the great plasticity of noncoding DNA, such as shuffling and sequence variability of binding siteswithin functional modules, low scale translocations, inversions and dupli-cations. We used a “gene-centric” approch, in that it starts with a list oforthologous genes between two species, and applies a local alignment algo-rithm to the corresponding flanking intergenic regions and intronic regionsof these orthologous pairs. Hence, it is a local alignment strategy but ap-plied systematically on a genome-wide scale and, for this reason, we decidedto call it “lobal”.The recent availability of 12 Drosophila species sequences and annota-tions [7] offers a complete and reliable genomic dataset for developing andtesting methods for comparative genomics of non coding DNA. We appliedour lobal alignment approach to align Drosophila melanogaster to severalother drosophila species (D. yakuba, D. pseudoobscura, D. virilis, ...), forwhich a reliable genome build and annotation is available.
Gene–centric comparative approach
For each Drosophila species examined (listed in Tab.1 and referenced to asD.xxx), we compile a list of genes orthologous to a D.melanogaster (D.mel)gene, according to the “12 drosophila genomes project” data (Tab.1 andMaterial and Methods). For each pair of D.mel/D.xxx orthologous genes,we extract in both species the upstream, downstream and intronic regions.Upstream and downstream regions are extracted up to the next neighbor-ing gene (see Material and Methods for more details), taking the longesttranscript as a reference in case of multiple transcripts. All sequences havebeen previously masked for repeats using the RepeatMasker program [8]. Atthis stage, the comparison procedure crucially depends on the availabilityof genomic annotations (i.e. gene coordinates and orthology relationships).The orthologous regions are then aligned using a local alignment proceduredescribed later. For the alignment, the orthologous regions are orientedsuch that the corresponding genes are in the same orientation. Using this2enecentric approach, most intergenic regions are considered twice. For ex-ample, the region chr4:64404-68333 in D.melanogaster is first considered asthe upstream region of the
PlexB gene, and then as the downstream regionof the ci gene. This redundancy is taken care of in the post-processing step,described later. Alignment procedure
For each pair of orthologous D.mel/D.xxx genes, we respectively align theirupstream regions, downstream regions and introns. This is done by ori-enting the transcripts in the same direction, such as to distinguish samefrom opposite strand. Local pairwise alignments between orthologous se-quences was performed using CHAOS [5], which is an heuristic alignmentalgorithm with some peculiar features optimized for large non coding DNAsequences. CHAOS works by chaining small words (called seeds) that matchbetween the two input sequences. Unlike BLAST, it is a double seed tech-nique which allows some degeneracy in seeds. It chains toghether seeds thatare closer than a maximum distance d and it returns the highest scoringchains, according to a standard NeedlemanWunsch metric. These highestscoring chains constitue the conserved noncoding blocks (CNBs). Becauseit is a local alignment, it is able to identify nonsyntenic CNBs order with avery high resolution. Moreover, because the alignment is performed on bothstrands, we also identify CNBs resulting from inversion events. Also, it isable to rapidly align large sequences with a better specificity than purely lo-cal aligners, thanks to the double seed technique. We choose a quite sensitiveset of parameters in CHAOS (see Material and Methods). An assessmentof statistical significance of alignment scores is introduced to discriminatetrue from random alignments. The scoring cutoff is calculated by aligningrandomly selected non-orthologous sequences, and setting a false discoveryrate (FDR) of 2 · − . Matherials and methods
Sequences and annotations have been downloaded from the AAA site [7] asfasta and GFF3 files respectively. The Drosophila melanogaster sequencesand annotations correspond to version 4.3. For the other drosophila, thesequences correspond to the CAF1 assemblies and are now available fromGenBank. The annotations result from a reconciliation procedure of variousannotations, whereas the homology maps are built using a fuzzy reciprocalblast. For details, see [7]. 3e rely on the gene annotations to extract the D.melanogaster introns.When several transcripts exist for a single gene, we consider the longesttranscript and its introns. For other drosophila species, no annotations ex-ist for intron/exon structure. Hence, we extract the locus corresponding tothe full gene, and align it using CHAOS to each intron of the orthologousgene. For intergenic regions, we applied a conservative definition. We definethe upstream region as the longest consecutive sequence of nonexonic, non-intronic nucleotides on the 5’ end of the longest transcript, and similarly forthe downstream region. While this is an intuitive definition in general, ithas particular implications in the case of nested genes. For a gene A nestedinside the intron of a gene B, the intergenic regions associated to gene Awill start at the 5’/3’ extremities of gene B, in order to respect the previousdefinition.We use CHAOS with the following set of parameters: chaos -wl 7 -co 12-b -v -rsc 1500. The last parameter is a very loose lower threshold on thealignment score, but we apply more stringent thresholds in the postprocess-ing step.
Post–processing and availability
As mentioned previously, sequences are often considered and aligned twice,resulting in redundant CNBs. We eliminate this redundancy by scanningthe output of the alignments, and merging overlapping CNBs. More pre-cisely, we merge two CNBs if they meet all of the following requirements: (i)they overlap in D.melanogaster, (ii) they overlap in the other species, (iii)both blocks are in the same orientation in D.melanogaster, and in the sameorientation in the other Drosophila species. Conditions (i) and/or (ii) are forexample not fulfilled in the case of duplications; in this case, the CNBs arenot merged and appear as distinct blocks. Each block is assigned a uniqueidentifier and is labelled with its score, percentage identity, as well as withthe name of the gene(s) in the surrounding of which it is located. For thereasons mentioned previously, a block often refers to its two flanking genes.The full collection of CNBs for all eleven pairwise comparisons is avail-able as a queryable database, named DrosOCB (for Drosophila ConservedBlocks). It can be accessed through a userinterface which allows to query aparticular gene or a genomic region. Our database is linked with the UCSCgenome browser [9], such that CNBs can be displayed in their genomic con-4ext with the browser.
DrosOCB database content
In Table 1, the content of the database is summarized for each species com-pared with D.melanogaster. Their phylogenetic relationship is shown in Fig.1. The cumulated size of the D.melanogaster sequences (intronic and inter-genic) which are aligned varies in the range between 78.6 Mbp and 87.7 Mbp,depending on the total number of orthologous genes between D.melanogasterand the other species. Considering that the D.melanogaster genome size isaround 120 Mbp, this means that we aligned between 65% and 73% ofthe D.melanogaster genome. Analyzing the catalog of CNBs, we can makesome observations about the conservation features of Drosophila genus atlarge scale. The estimated percentage of non coding sequences evolutionaryconstrained in Drosophila genome is reported in Tab. 1 and displayed in Fig.2. As expected, the percentage of conservation follows the evolutionary dis-tance. It varies between 16% (13%) for D.melanogaster/D.virilis intergenic(intronic) sequences, the most evolutionary distant species in the phyloge-netic tree, and 68% (54%) for D.melanogaster/D.sechellia. These estima-tions are lower than the ones obtained for D.melanogaster compared withDrosophila D.virilis, D.pseudoobscura and D.yakuba from previous work[911]. However, we applied a rather conservative threshold on the scoresof the CNBs, such as to reduce the number of spurious alignments. Theseconservation percentages are always higher in intergenic regions as com-pared to intronic regions. However, these figures should be taken with somecare, as some regions, labelled as “intergenic” in some drosophila species(and thus not aligned as introns) might well turn out to be intronic, as dis-tant exons will become better annotated. In fact, whereas the mean size ofgenes in D.melanogaster is 6.1 kb, it ranges from 2.9 kb (D.sechellia) to 4.1kb (D.virilis) for the other species, indicating that some gene annotationsmight still miss distant exons.Interestingly, these proportions are roughly constant inside the melanogastersubgroup (around 50%), indicating that the difference in the evolutionarydistance between D.melanogaster and D.simulans/D.sechellia on one hand(about 5 My), and D.yakuba/D.erecta (about 10 My) on the other hand istoo small to affect the conservation of noncoding DNA. There is a impor-tant decrease outside this branch (roughly 25% for D.ananassae, the closestspecies outside the melanogaster subgroup). The percentages for speciesoutside the Sophophora subgenus (D.virilis, D.mojavensis and D.grimshawi)5re again very comparable (about 15%). The mean size of CNBs obtainedin our output is comprized between 50 bp and 94 bp, and increases withdecreasing evolutionary distance, as expected (cf. column 6 in Tab.1). It isshorter for intronic regions than for intergenic regions. The lower thresholdof 50 bp indicates that, although the alignment procedure is sensitive inallowing some degeneracy in the compared sequences, it preserves a certaindegree of selectivity, discarding very short isolated CNBs with a score belowthe cutoff threshold.Due to the fact that we use a sensitive local alignment procedure, we areable to spot small scale genomic rearrangements that are not visible in stan-dard alignments (see Fig. 3). As an illustration, we will focus on a particularfeature, namely inverted CNBs. By this, we mean CNBs that lie on oppositestrands in the orthologous regions, a situation which might result from localgenomic inversions in one of the two species. Since the drosophila genomesare known to have an extreme plasticity at large/medium scales [12], it isinteresting to verify whether this is also true in or below the kb range. Fig.4 plots the percentage of CNBs that are inverted, for all eleven pairwisealignments, for intronic and intergenic regions. Depending on the evolution-ary distance, this percentage ranges from 15% to almost 30%. Interestingly,these percentages are very comparable for intergenic and intronic regions,indicating that the evolutionary dynamics is similar for these regions [10].In the custom track provided for UCSC genome browser, we use a particularcolor coding to distinguish between CNBs on the same strand (grey boxes)and the inverted CNBs (red boxes). Figure 5 shows an interesting exampleof such an event, in one of the introns of the white gene on chromosome X.The central region is highly conserved in all pairwise alignments, but theCNBs are inverted in the pairwise alignments with D.mojavensis, D.virilisand D.grimshawi. This is not due to an inversion of the full gene locus, sincethe transcripts are all taken in the same orientation when performing thealignment. The inverted CNBs have a very high score and a conservation of90% over 66 bp, excluding that they might be spurious alignments. Hence,one can speculate that there exists a local inversion inside this intron whichappeared in the common branch of these three Drosophila species. Notethat we applied a high threshold on the score of the DrosOCB blocks (2200,
F DR = 0 . · − ) such as to reduce the number of displayed blocks. In-terestingly, the ORegAnno track [13] in the lower part of the UCSC windowindicates the presence of a regulatory element, more precisely an enhancer,underlying that even functional elements are subject to extensive rearrange-ments, as previously noted [1–3]. 6 Conclusions
In this paper, we have described a new, local but genomewide alignmentprocedure for binary comparisons of Drosophila melanogaster with elevencurrently available drosophila genomes. We have shown that the resultingcollection of CNBs, organized in the DrosOCB database, constitutes a high-resolution collection of noncoding DNA conservation in drosophila. Thesmall size of the blocks, and the local nature of the alignment highlightssmall scale genomic rearragement events, that are not apparent from otherapproaches. As a preliminary study, we have focused on inverted CNBswhich might correspond to small scale inversions. A more detailed analy-sis of this phenomenon and other rearragements is left for a future morecomplete investigation. An interesting aspect of our preliminary analysis isthat the localization of these inverted blocks is not evenly distributed amongchromosomes. Chromosome X seems to have a much higher than averageproportion of these inverted blocks, indicating that it has undergone moreextensive rearragements than the autosomal chromosomes, as noted pre-viously [14]. The alignment procedure described in this work provides anoptimal tool for a high resolution comparison of non coding DNA sequences.The content of the database, and the observed high rate of low scale reshuf-fling suggest that this database of CNBs can constitute the starting pointfor several investigations, related to the evolution of regulatory DNA inDrosophila, the in silico identification of unannotated functional elementsand the search for transcription factor binding sites.
Acknowledgements
This work was partially supported by FIRB grant RBNE03B8KK from theItalian Ministry for Education, University and Research. The authors wouldlike to thank Pierre Mouren and Davide Cor`a for useful suggestions andtechnical support.
References [1] M.Z. Ludwig, A. Palsson, E. Alekseeva, C.M. Bergman, J. Nathan,M. Kreitman, Functional evolution of a cis–regulatory module.
PLoSBiology , 3(4):e93, 2005[2] A.M. Moses, D.A. Pollard, D.A. Nix, V.N. Iyer, X.Y. Li, M.D. Big-gin, M.B. Eisen, Large-scale turnover of functional transcription factor7inding sites in Drosophila.
PLoS Computational Biology , 2(10):e130,2006[3] R. Sanges, E. Kalmar, P. Claudiani, M. D’Amato, F. Muller, E. Stupka,Shuffling of cis-regulatory elements is a pervasive feature of the verte-brate lineage.
Genome Biology , 7(7):R56., 2006[4] E. Emberly, N. Rajewsky, E.D. Siggia Conservation of regulatory ele-ments between two species of Drosophila.
BMC Bioinformatics , 20;4:57,2003[5] M. Brudno, M. Chapman, B. Gottgens, S. Batzoglou, B. Morgenstern,Fast and sensitive multiple alignment of large genomic sequences.
BMCBioinformatics , 4:66, 2003[6] M. Brudno, S. Malde, A. Poliakov, C.B. Do, O. Couronne, I. Dubchak,S. Batzoglou, Glocal alignment: finding rearrangements during align-ment.
Bioinformatics , 19 Suppl 1:i54-62, 2003[7] https://rana.lbl.gov/drosophila/[8] Smit, A.F.A., Origin of interspersed repeats in the human genome,Curr. Opin. Genet. Devel. 6 (6), 743-749 1996[9] Kuhn RM et al. The UCSC genome browser database: update 2007.
Nucleic Acids Res , 35(Database issue):D668-73, 2007.[10] C.M. Bergman, M. Kreitman Analysis of conserved noncoding DNAin Drosophila reveals similar constraints in intergenic and intronic se-quences.
Genome Research , 11(8):1335-45, 2001[11] Siepel A et al. Evolutionarily conserved elements in vertebrate, insect,worm, and yeast genomes.
Genome Research , 15(8):1034-50, 2005[12] Ranz J, Casals F, Ruiz A, How malleable is the eukaryotic genome?Extreme rate of chromosomal rearrangement in the genus Drosophila,Genome Research, 11:230?239 2001.[13] Montgomery S et al. ORegAnno: an open access database and curationsystem for literature-derived promoters, transcription factor bindingsites and regulatory variation. Bioinformatics, 22(5):637?640 (2006).[14] Gonzalez J, Ranz J, Ruiz A, Chromosomal elements evolve at differentrates in the drosophila genome. Genetics, 161:1137?1154 2002.8 enome Number Size of % of mean sizesize* orthologous aligned conserved of CNBs(Mbps) genes sequences (Mbps) sequences (bps) intronic intergenic intronic intergenic intronic intergenicD.sim 139,8 11540 36,9 55,4 50% 61% 103 118D.sec 168,9 12074 39,2 56,7 54% 68% 102 115D.yak 168,0 13005 45,7 62,5 52% 64% 84 94D.ere 154,9 12459 44,1 60,7 51% 65% 88 95D.ana 234,3 11530 47,7 61,7 25% 33% 57 59D.pse 154,9 11795 66,9 44,1 19% 24% 52 54D.per 191,0 10657 39,4 58,3 17% 21% 53 54D.wil 238,9 10907 52,5 79,9 15% 18% 49 49D.moj 196,6 11213 51,1 76,6 13% 15% 49 49D.vir 209,0 11346 51,6 77,2 13% 16% 49 49D.gri 203,3 11541 48,9 71,7 13% 16% 49 49Table 1: DrosOCB database content summary From the left, columns indicate the Drosophila species, the size of thegenome (defined as the total size of the genome fasta files downloaded), the number of orthologous genes betweenD.melanogaster and the second species, the size of intronic and intergenic sequences aligned, the percentage ofconserved sequences (i.e. the total nonredundant size of intronic and intergenic CNBs divided by the size ofintronic/intergenic sequences aligned), the mean size of intronic and intergenic CNBs.9