Anchor points for genome alignment based on Filtered Spaced Word Matches
Chris-Andre Leimeister, Thomas Dencker, Burkhard Morgenstern
AAnchor points for genome alignment based on
Filtered Spaced Word Matches
Chris-Andr´e Leimeister , Thomas Dencker , and BurkhardMorgenstern University of G¨ottingen, Department of Bioinformatics,Goldschmidtstr. 1, 37077 G¨ottingen, Germany University of G¨ottingen, Center for Computational Sciences,Goldschmidtstr. 7, 37077 G¨ottingen, GermanyNovember 16, 2018
Abstract
Alignment of large genomic sequences is a fundamental task in com-putational genome analysis. Most methods for genomic alignment usehigh-scoring local alignments as anchor points to reduce the searchspace of the alignment procedure. Speed and quality of these methodstherefore depend on the underlying anchor points. Herein, we proposeto use
Filtered Spaced Word Matches to calculate anchor points forgenome alignment. To evaluate this approach, we used these anchorpoints in the the widely used alignment pipeline
Mugsy . For distantlyrelated sequence sets, we could substantially improve the quality ofalignments produced by
Mugsy . Introduction
Sequence comparison is one of the most fundamental tasks in computa-tional biology. Here, a basic task is to align two or several DNA or proteinsequences – either globally , over their entire length, or locally , by restrictingthe alignment to a single region of homology. Standard approaches to se-quence alignment assume that the input sequences derived from a commonancestral sequence, and that evolutionary events are limited to substitutions,insertions and deletions of single residues or small sequence segments. In thiscase, sequence homologies can be represented by global sequence alignments ,1 a r X i v : . [ q - b i o . GN ] M a r hat is by inserting gap characters into the sequences such that evolutionarilyrelated sequence positions are arranged on top of each other. Under mostscoring schemes, calculating an optimal alignment of two sequences takestime proportional to the product of their lengths and is therefore limited torather short sequences [42, 47, 24, 37, 22].With the rapidly increasing number of partially or fully sequenced genomes,alignment of genomic sequences has become an important field of researchin bioinformatics, see [23] for a recent review and evaluation of some of themost popular approaches. Here, the first challenge is the sheer size of theinput sequence that makes it impossible to use traditional algorithms withquadratic run time. The second challenge is that related genomes often share multiple regions of local sequence homology, interrupted by non-conservedparts of the sequence where no significant similarities can be detected. Thismeans that neither global nor local alignment methods can properly repre-sent the homologies between whole genomes. Finally, evolutionary eventssuch as duplications and large-scale rearrangements must be taken into ac-count. Since it is not possible, in general, to represent homologies amonggenomes in one single alignment, advanced genome aligners return align-ments of so-called Locally Collinear Blocks , i.e. blocks of segments of theinput sequences that contain the same genes in the same relative order.Since the late Nineteen Nineties, major efforts have been made to a ad-dress the problem of genome alignment, and many approaches have beenpublished. One of the first multiple-alignment programs that was appliedto genomic sequences was DIALIGN [38, 40]. This program composes mul-tiple alignments from chains of local pairwise alignments, and it does notpenalize gaps; it is therefore able to align sequences where local homologiesare separated by long non-homologous segments. The program has beenapplied, for example, to identify small non-coding functional elements in ge-nomic sequences [3, 12]. However, the program was initially not designed forlarge genomic sequences, and it is limited to sequences up to around 10 kb .Moreover, DIALIGN is not able to deal with duplications, rearrangementsand homologies on inverse strands of genomes.To align longer sequences, most programs for genomic alignment rely onsome sort of anchoring [30, 39], In a first step, they use a fast method forlocal alignment to identify high-scoring local homologies, so-called anchorpoints . Next, chains of such local alignments are calculated and, finally,sequence segments between the chained high-scoring local alignments arealigned with a slower but more sensitive alignment method. For multiplesequence sets, anchor points can be defined either between pairs of sequencesor between several or all of the input sequences. A pioneering tool to find2nchor points for genomic alignment was
MUMmer [18]; the current versionof the program [32] is considered the state-of-the-art in alignment anchoring.MUMmer uses maximal unique matches as pairwise anchor points to aligngenomic sequences or protein sequences. By contrast,
MGA [28] is a tool for multiple alignment of genomic sequences that uses maximal exact matches between all sequences within a given sequence set. Both
MUMmer and
MGA use suffix trees [31] to rapidly identify pairs or blocks of identicalwords, one word from each of the sequences, that are then used as anchorpoints. Both programs are able to align entire bacterial genomes,
MUMmer was also used in the
A. thaliana genome project [48]. However, since theprobability of homologous exact matches rapidly decreases with increasingdivergence, they are most useful to compare closely related genomes, suchas different strains of
E. coli .Other approaches to genome alignment are
OWEN [43],
AVID [7],
MAVID [8],
LAGAN and Multi-LAGAN [10],
CHAOS/DIALIGN [9], the
VISTAgenome pipeline [21],
TBA [5] and
Mauve [15], see [19, 4] for a review. Allof these methods are based on alignment anchoring, and most of them areable to deal with duplications and genome rearrangements. Some methodsfor genomic alignments are based on statistical properties of the sequences[6, 15]. Other methods are based on graphs , for example on
A-Bruijn graphs [45] or on cactus graphs [44]. A further development of
Mauve called pro-gressiveMauve uses palindromic spaced seeds instead of exact word matchesas anchor points [16]. That is, for a given binary pattern of length (cid:96) repre-senting match and don’t-care positions, one searches for a set of (cid:96) -mers, one (cid:96) -mer from each of the input sequences, such that all (cid:96) -mers have matchingnucleotides at the match positions. At the don’t-care positions, mismatchesare allowed. Palindromic patterns are used to cover both strands of theinput sequences. Spaced seeds are used in database searching [36, 17] andalignment-free sequence comparison [33] since they have been shown to leadto better results than contiguous word matches.
Mugsy [2] is a popular software pipeline for multiple whole-genome align-ment. In a first step, this program uses
Nucmer [32] to construct all pairwisealignments of the input sequences.
Nucmer , in turn, uses
MUMmer to findexact unique word matches which are used as alignment anchor points. An alignment graph is constructed from these pairwise alignments using the
Se-qAn software [20], and
Locally Collinear Blocks are constructed. Finally,a multiple alignment is calculated using
SeqAn::TCoffee [46].
Mugsy hasbeen designed to align closely related genomes, such as different strains of abacterium. Here, it produces alignments of high quality. On more distantlyrelated genomes, however, the program is often outperformed by other mul-3iple genome aligners [23].Finding anchor points is the most important step in whole-genome se-quence alignment. Here, a trade-off between speed , sensitivity and precision is necessary. A sufficient number of anchor points is required in order toreduce the search space and thereby the run time for the subsequent, moresensitive alignment routine. Wrongly chosen anchor points, on the otherhand, can substantially deteriorate the quality of the final output align-ment. If spurious similarities are used as anchor points, this not only resultsin non-homologous parts of the sequences being aligned. Wrong anchorpoints may also prevent the program from aligning biologically relevant,true homologies since aligning them may be incompatible with the selectedanchors. Also, if the number of anchor points is too large, finding optimalchains of anchor points can become computationally expensive.In this paper, we propose a novel algorithm to find pairwise anchorpoints for genomic alignments that is based on the Filtered Spaced WordMatches (FSWM) idea that we previously introduced [34]. Anchor pointsare calculated using a hit-and-extend approach where high-scoring spaced-word matches are used as seeds : for an underlying binary pattern of length (cid:96) representing match and don’t care positions, we rapidly identify spaced-wordmatches , i.e. length- (cid:96) segment pairs from the input sequences with matchingnucleotides at the match positions but with possible mismatches at the don’tcare positions. For each spaced-word match, we then calculate a similarityscore considering all aligned positions – including the don’t-care positions –, and we keep only those spaced-word matches that have a score above acertain threshold. These segment pairs are then extended to locally-maximalgap-free alignments, similar as in BLAST [1]. To evaluate our anchoringapproach, we used the
Mugsy pipeline using our software in the initial step,to find anchor points. For closely related input sequences, the quality of theresulting alignments is comparable to the original version of
Mugsy whereexact word matches are used for anchoring. Our paproach is far superior,however, if distal sequences are to be aligned, where most other alignmentapproaches either fail to produce alignments or require an unacceptableamount of time.Through our web site, we provide the adapted
Mugsy pipeline with ouranchoring approach as a pipeline for genome-sequence alignment that can bereadily installed. A standalone version of our spaced-words software is pro-vided as well, such that developers can integrate it into their own sequence-analysis pipelines. 4 iltered Spaced Word Matches
For a sequence S of length L over an alphabet Σ and 0 < i ≤ L , S [ i ] denotesthe i -th symbol of S . For integers w ≤ (cid:96) , a binary pattern P of length (cid:96) and weight w is a word over { , } of length (cid:96) such that there are exactly w indices i with P [ i ] = 1. These positions are called match positions , whilepositions i with P [ i ] = 0 are called don’t-care positions . A spaced word withrespect to a pattern P is a word w over Σ ∪ {∗} where ‘ ∗ ’ is a wildcardcharacter not contained in Σ, and w [ k ] = ∗ holds if and only if k is a don’t-care position , i.e. if P [ k ] = 0, see also [33, 29]. A spaced word w with respectto a pattern P occurs in a sequence S at position i if S [ i + k −
1] = w [ k ] forall match positions k of the pattern P .For sequences S and S with lengths L and L , respectively and apattern P of length (cid:96) , and 1 ≤ i ≤ L − (cid:96) + 1 , ≤ j ≤ L − (cid:96) + 1, we say thatthere is spaced-word match between S and S at ( i, j ) with respect to P ifthe spaced words at i in S and at j in S are identical - in other words, iffor all match positions k in P , one has S [ i + k −
1] = S [ j + k − . Below is a spaced-word match between two
DNA sequences S and S at(5 ,
2) with respect to the pattern P = 1100101: S : G C T G T A T A C G T CS : S T A C A C T T A TP : 1 1 0 0 1 0 1Indeed, the spaced word ‘
T A ∗ ∗ C ∗ T ’ occurs at positions 5 in S and atposition 2 in S .Herein, we propose to use spaced-word matches as a first step to calcu-late anchor points for pairwise alignment. We therefore need some criterionto distinguish between spaced-word matches representing true homologies and random background matches. In a previous paper, we used spaced-wordmatches to estimate phylogenetic distances between genomic sequences [34].To this end, we first identified all spaced-word matches with respect to agiven pattern P . To remove spurious random spaced-word matches, we ap-plied a simple filtering procedure : using a nucleotide substitution matrix [13],we calculated for each spaced-word match the sum scores of all aligned pairsof nucleotides (including match and don’t-care positions), and we removedall spaced-word matches with a score below zero.A graphical representation of the spaced-word matches between two se-quences shows that this procedure can clearly separate random spaced-word5igure 1: Spaced-words histogram for a comparison of two bacterialgenomes,
Phaeobacter gallaeciensis
Rhodobacterales bacterium Y4I .All possible spaced-word matches with respect to a given binary pattern P are identified, and their scores are calculated as explained in the main text.The number of spaced-word matches with a score s is plotted against s . Twopeaks are visible, an approximately normally distributed peak for back-ground spaced-word matches, and a more complex peak for spaced-wordmatches representing homologies. With a cut-off value of zero, backgroundand homologous spaced-word matches can be reliably separated.matches from true homologies. If we plot for each possible score value s thenumber of spaced-word matches with score s , we obtain a bimodal distribu-tion with one peak for random matches and a second peak for homologies.We call such a plot a spaced-words histogram . For simulated sequence pairsunder a simple model of evolution, both peaks are normally distributed. Forreal-world sequences, the random peak is still normally distributed, but the‘homologous’ peak is more complex, see Figure 1. Even so, using a cut-off value of zero can clearly distinguish between random matches and truehomologies. More examples for spaced-words histograms are given in [34].Our approach to find anchor points for pairwise genomic alignment is asfollows. For given parameters (cid:96) and w , we first calculate a binary pattern6ith length (cid:96) and weight (number of match positions) w using our recentlydeveloped software rasbhari [25]. We then identify all spaced-word matcheswith respect to P . To find homologies even for distantly related sequences,we use patterns with a low weight; by default, we use a weight of w = 10.On the other hand, we use a large number of don’t-care positions, since thismakes it easier to distinguish true homologies from random spaced-wordmatches. By default, we use a pattern length of (cid:96) = 110, so our patternscontain 10 match positions and 100 don’t-care positions; we use the followingnucleotide substitution matrix described in [13]: A C G TA − − − C − − G − T score of each spaced-word match asthe sum of the substitution scores of all aligned pairs of nucleotides. Wethen discard all spaced-word matches with a score below zero.Next, we extend the identified spaced-word matches in both directionswithout gaps. As the starting point for this extension, we do not use the fullspaced-word matches, but their mid points. The reason for this is that, withour long patterns, even a high-scoring spaced-word match may not representsequence homologies over its entire length. It often occurs that some partof a spaced-word alignes homologous nucleotides, but another part extendsinto non-homologous regions of the sequences. There is a high probability,however, that the mid point of a long, high-scoring spaced-word match islocated within a region of true homology. Finally, we use the produced‘extended’ gap-free alignments as anchor points for alignment. Evaluation
To evaluate
Filtered Spaced Word Matches (FSWM) and to compare it tothe state-of-the-art approach to alignment anchoring, we used the
Mugsy software system. As mentioned above, the original
Mugsy uses
MUMmer tofind pairwise anchor points. We replaced
MUMmer in the
Mugsy pipeline byour
FSWM -based anchor points and evaluated the resulting multiple align-ments. In addition, we compared these alignments to alignments producedby the multiple genome aligner
Cactus [44].
Cactus is known to be one ofthe best existing tools for multiple genome alignment; it performed excellent7n the
Alignathon study [23]. To measure the performance of the comparedmethods, we used simulated genomic sequences as well as three sets of realgenomes. To make
MUMmer directly comparable to
FSWM , we used a min-imum length of 10 nt for maximum unique matches, corresponding to thedefault weight (sum of match positions ) used in Spaced Words . Note that,by default,
MUMmer uses a minimum length of 15 nt . With this defaultvalue, however, we obtained alignments of much lower quality. The Cactus tool was run with default values.
Simulated genome sequences
To simulate genomic sequences, we used the
Artificial Life Framework (ALF) developed by Dalquen et al. [14].
ALF evolves gene sequences based ona probabilistic model along a randomly generated tree, starting with anancestral gene. During this process evolutionary events are logged such thatthe true
MSA is known for each simulated gene family. This true
MSA canthen be used as reference to assess the quality of automatically generatedalignments.We generated a series of 14 data sets, each containing 30 simulated‘genomes’, with increasing mutation rates for the different data sets. Forall other parameters in
ALF , we used the default settings. In each data set,there are 750 simulated gene families such that one gene from each genefamily is present in each of the 30 simulated genomes. Thus, each of the‘genomes’ contains the same set of 750 genes. We varied the mutation ratesbetween an average of 0.1013 substitutions per position for the first dataset to an average of 0.8349 substitutions per position for the 14th data set.The maximal pairwise distances between all pairs of sequences within onedata set ranges from 0.1640 for the first to 1.0923 for the 14th data set. Thesimulated genes have an average length of about 1500 bp , summing up to atotal size of about 32 MB per data set.To assess the quality of the produced alignments, we calculated recall and precision values in the usual way. If, for one given data set, S is theset of all positions in the 30 simulated genomes, we denote by A ⊂ S theset of all pairs of positions aligned by the alignment that is to be evaluatedwhile R ⊂ S denotes the set of all pairs of positions aligned in the referencealignment. recall and precision are then defined as recall = | A ∩ R || R | , precision = | A ∩ R || A | (1)The harmonic mean of reall and precision is called the balanced F-score and8s often used as an overall measure of accuracy; it is thus defined as F score = 2 × precision × recallprecision + recall To estimate these three values, we used the tool mafComparator which wasalso used in the
Alignathon study [23]. Since it is impractical to considerthe entire set S of pairs of positions of the test sequences, we sampled10 million pairs of positions for each data set. This corresponds to theevaluation procedure used in Alignathon .For the simulated sequence sets, their precision and recall values areshown in Figure 2. For data sets with smaller mutation rates, alignmentsobtained with
FSWM are only slightly better than those obtained with
MUMmer . However, if the mutation rate increases, our spaced-words ap-proach substantially outperforms the original version of
Mugsy where exactword matches are used to find anchor points. Not only more homologiesare detected but also the precision is slightly higher if
Filtered Spaced WordMatches is used instead of
MUMmer . Real-world genome sequences
For real-world genome families, it is usually not possible to calculate the precision of MSA programs because it is, in general, not known which se-quence positions exactly are homologous to each other and which ones arenot. If there are core blocks of the sequences for which the biologically cor-rect alignment is known, at least the recall can be calculated for these coreblocks. For most genome sequences, however, no such core blocks are avail-able. To evaluate
Mugsy , the authors of the program used the number of core columns of the produced alignments as a criterion for alignment quality[2]. Here, a core column is defined as a column that does not contain gaps, i.e. a column that aligns nucleotides from all of the input sequences. Inaddition, the authors of
Mugsy used the number of pairs of aligned positions of the aligned sequences as an indicator of alignment quality. In this paper,we are using the same criteria to evaluate multiple alignments of real-worldgenomes.As a first real-word example, we used a set of 29
E.coli/Shigella genomesthat has already been used in the original
Mugsy paper, see supplemen-tary material for details; these sequences have also been used to evaluatealignment-free methods [26, 49, 41]. The total size of this data set is about141 MB . As a second test set, we used another prokaryotic data set whichconsists of 32 complete Roseobacter genomes (details in the supplementary R e c a ll average pairwise distance MUMs+mugsyspacedAnchors (k=10)+mugsyspacedAnchors (k=8)+mugsycactus P r e c i s i o n average pairwise distance MUMs+mugsyspacedAnchors (k=10)+mugsyspacedAnchors (k=8)+mugsycactus
Figure 2:
Recall and
Precision of Mugsy with anchor points from
FilteredSpaced Word Matches (FSWM) and
MUMmer , respectively, and of
Cactus on simulated genomic sequences generated with
ALF , see main text fordetails.
FSWM was used with the default weight w = 10, i.e. with 10 matchpositions in the underlying pattern. In addition, we ran FSWM with w = 8. material ). This data set was used to assess the performance on more dis-tantly related organisms than the E.coli/Shigella strains. The total size ofthese data set is about 135 MB . To test our approach on eukaryotic genomes,we used as a third test case a set of nine fungal genomes, namely Coprinopsis F ‐ s c o r e average pairwise distance MUMs+mugsyspacedAnchors (k=10)+mugsyspacedAnchors (k=8)+mugsycactus
Figure 3:
F-Score of Mugsy with anchor points from
Filtered Spaced WordMatches and
MUMmer , respectively, and of
Cactus on simulated genomicsequences generated with
ALF . cinerea , Neurospora crassa , Aspergillus terreus , Aspergillus nidulans , Histo-plasma capsulatum , Paracoccidioides brasiliensis , Saccharomyces cerevisiae , Schizosaccharomyces pombe and
Ustilago maydis (genbank accession num-bers are given in the supplementary material). The total size of this thirddata set is about 253 MB . The results of Mugsy with
MUMmer and
FSWM for the three real-world data sets are shown in Table 1, together with theresults obtained with
Cactus . In addition to the number of core columns and the number of aligned pairs of positions, the table contains the num-ber of core Locally Collinear Blocks , i.e. the number of Locally CollinearBlocks involving all of the input sequences, and the total number of
LocallyCollinear Blocks returned by the alignment programs.
Program run time
Table 2 reports the program run times of
Mugsy with
FSWM , Mugsy with
MUMmer and
Cactus on the above three real-world sequence sets. In addi-tion, the table contains the run times for
FSWM and
MUMmer alone.11 core LCBs
E.coli/Shigella genomes
Mugsy + MUMmer
539 1,61E+09 2,827,115 4,138
Mugsy + FSWM
664 1,63E+09 2,867,432 5,906
Cactus
Roseobacter genomes
Mugsy + MUMmer
39 3,63E+08 13,654 13,501
Mugsy + FSWM
859 7,15E+08 824,054 30,836
Cactus
Mugsy + MUMmer
Mugsy + FSWM
Cactus
E.coli/Shigella genomes, 32
Roseobacter genomes and 9 fungal genomes, calculated with
Mugsy using anchor pointsfrom our spaced-words approach and from
MUMs , respectively, and with
Cactus . The first column contains the number of core columns , i.e. thenumber of columns in the multiple alignment that do not contain gaps; thesecond column contains the total number of aligned pairs of positions in thealignment. The third column contains the number of core Locally CollinearBlocks (LCBs) i.e. the number of LCBs that involve all of the alignedgenomes (‘core LCBs’), while the last column contains the total number of
LCBs . In this paper, we proposed a novel approach to calculate anchor points forgenome alignment. Finding suitable anchor points is a critical step in allmethods for genome alignment, since the selected anchor points determinewhich regions of the sequences can be aligned to each other in the finalalignment. A sufficient number of anchor points is necessary to keep thesearch space and run time of the main alignment procedure manageable, so sensitive methods are needed to find anchor points. Wrongly selected anchorpoints, on the other hand, can seriously deteriorate the quality of the finalalignments, so anchoring procedures must also be highly specific .Earlier approaches to genomic alignment used exact word matches as an-chor points [18, 28], since such matches can be easily found using suffix treesand related indexing structures. These approaches are limited, however, to12 .coli/Shigella Roseobacter fungal genomesFSWM
59 83 110
FSWM + Mugsy
638 6428 1488
MUMmer
73 63 43
MUMmer + Mugsy
286 1099 63
Cactus
714 1775 775Table 2: Run time in minutes for three different multiple genome-alignmentmethods applied to the three test data sets that we used in our programevaluation.situations where closely related genomes are to be aligned, for example dif-ferent strains of a bacterium. In modern approaches to database searching, spaced seeds are used to find potential sequence homologies [35, 27, 11].Here, binary patterns of match and don’t care positions are used, and twosequence segments of the corresponding length are considered to match ifidentical residues are aligned at the match positions, while mismatches areallowed at the don’t care positions. Such pattern-based approaches are more sensitive than previous methods that relied on exact word matches.We previously proposed to apply the ‘spaced-seeds’ idea to alignment-free sequence comparison, by replacing contiguous words by so-called spacedwords , i.e. by words that contain wildcard characters at certain pre-definedpositions [33]. More recently, we introduced filtered spaced word matches [34] to estimate phylogenetic distances between genome sequences. In thelatter approach, we first identify spaced-word matches using relatively longpatterns with only few match positions. For the identified matching seg-ments, we then look at all aligned pairs of nucleotides, including the ones atthe don’t-care positions, and we discard spaced-word matches if the overalldegree of similarity between the two segments is below a threshold. Phylo-genetic distances can be estimated based on the aligned nucleotides at thedon’t-care positions of the remaining spaced-word matches. We showed thatthis procedure is fast and highly sensitive, and it can reliably distinguish be-tween true homologies and spurious sequence similarities.In the present study, we used filtered spaced word matches to calculatehigh-quality anchor points for genomic sequence alignment. Instead of usingspaced-word matches directly as anchor points, we extend them into bothdirections, similar to the hit-and-extend approach to database searching. Toevaluate these anchor points, we integrated them into the popular genome-alignment pipeline Mugsy . Test runs on simulated genome sequences showthat, for closely related sequences,
Mugsy produces alignments of high qual-13ty with both types of anchor points. For more distantly related sequences,however, the recall values of the program drop dramatically if anchor pointsare calculated with
MUMmer while, with our spaced-word matches, one ob-serves recall values close to 100% for distances up to around 0.7 substitutionsper position.For real-world genomes, it is more difficult to evaluate the performanceof genome aligners since there is only limited information available on whichpositions are homologous to each other and which ones are not. Angiuoliand Salzberg [2] therefore used the number of aligned pairs of positionsas an indicator of alignment quality, together with the size of the ‘corealignment’, i.e. the number of alignments columns that do not contain gaps.At first glance, these criteria might seem questionable; it would be trivial tomaximize these values, simply by aligning sequences without internal gaps,by adding gaps only at the ends of the shorter sequences. However, as shownin Figure 2, all MSA programs in our study have high precision values, i.e. positions aligned by these programs are likely to be true homologs.In this situation, the number of aligned position pairs and size of the ‘corealignment’ can be considered as a proxy for the recall of the applied methods i.e. the proportion of homologies that are correctly aligned.For distantly related sequence sets, the total run time of
Mugsy is muchhigher with our
FSWM anchoring approach than with
MUMmer . One rea-son for the increased run time with
FSWM is the fact that, with spaced-words, far more
Locally Collinear Blocks are detected, than if exact wordmatches are used as anchor points, especially for distantly related sequenceswhere exact word matching is not very sensitive. One possible solutionfor this issue would be to apply user-defined threshold values for the totalnumber of returned
Locally Collinear Blocks or for their similarity scores,to reduce the run time of the final alignment procedure for large genomicsequences.
References [1] S. F. Altschul, W. Gish, W. Miller, E. M. Myers, and D. J. Lipman. Ba-sic local alignment search tool.
Journal of Molecular Biology , 215:403–410, 1990.[2] S. V. Angiuoli and S. L. Salzberg. Mugsy: fast multiple alignment ofclosely related whole genomes.
Bioinformatics , 27:334–342, 2011.143] L. M. Barton, B. G¨ottgens, M. Gering, J. G. Gilbert, D. Grafham,J. Rogers, D. Bentley, R. Patient, and A. R. Green. Regulation of thestem cell leukemia (SCL) gene: a tale of two fishes.
Proc. Natl. Acad.Sci. USA , 98:6747–6752, 2001.[4] S. Batzoglou. The many faces of sequence alignment.
Briefings inBioinformatics , 6:6–22, 2005.[5] M. Blanchette, W. J. Kent, C. Riemer, L. Elnitski, A. F. A. Smit,K. M. Roskin, R. Baertsch, K. Rosenbloom, H. Clawson, E. D. Green,D. Haussler, and W. Miller. Aligning multiple genomic sequences withthe threaded blockset aligner.
Genome Research , 14:708–715, 2004.[6] R. K. Bradley, A. Roberts, M. Smoot, S. Juvekar, J. Do, C. Dewey,I. Holmes, and L. Pachter. Fast statistical alignment.
PLOS ComputBiol , 5:e1000392, 2009.[7] N. Bray, I. Dubchak, and L. Pachter. AVID: A global alignment pro-gram.
Genome Research , 13:97–102, 2003.[8] N. Bray and L. Pachter. MAVID multiple alignment server.
NucleicAcids Research , 31:3525–3526, 2003.[9] M. Brudno, M. Chapman, B. G¨ottgens, S. Batzoglou, and B. Morgen-stern. Fast and sensitive multiple alignment of large genomic sequences.
BMC Bioinformatics , 4:66, 2003.[10] M. Brudno, C. Do, G. Cooper, M. Kim, E. Davydov, NISC SequencingConsortium, E. Green, A. Sidow, and S. Batzoglou. LAGAN and multi-LAGAN: Efficient tools for large-scale multiple alignment of genomicDNA.
Genome Research , 13:721–731, 2003.[11] B. Buchfink, C. Xie, and D. H. Huson. Fast and sensitive proteinalignment using DIAMOND.
Nature Methods , 12:59–60, 2015.[12] M. A. Chapman, F. J. Charchar, S. Kinston, C. P. Bird, D. Grafham,J. Rogers, F. Gr¨utzner, J. A. M. Graves, A. R. Green, and B. G¨ottgens.Comparative and functional analysis of LYL1 loci establish marsupialsequences as a model for phylogenetic footprinting.
Genomics , 81:249–259, 2003.[13] F. Chiaromonte, V. B. Yap, and W. Miller. Scoring pairwise genomicsequence alignments. In R. B. Altman, A. K. Dunker, L. Hunter, and15. E. Klein, editors,
Pacific Symposium on Biocomputing , pages 115–126, 2002.[14] D. A. Dalquen, M. Anisimova, G. H. Gonnet, and C. Dessimoz. ALF- a simulation framework for genome evolution.
Molecular Biology andEvolution , 29:1115–1123, 2012.[15] A. C. E. Darling, B. Mau, F. R. Blattner, and N. T. Perna. Mauve:multiple alignment of conserved genomic sequence with rearrangements.
Genome Research , 14:1394–1403, 2004.[16] A. E. Darling, B. Mau, and N. T. Perna. progressiveMauve: MultipleGenome Alignment with Gene Gain, Loss and Rearrangement.
PLOSONE , 5:e11147+, 2010.[17] A. E. Darling, T. J. Treangen, L. Zhang, C. Kuiken, X. Messeguer, andN. T. Perna.
Algorithms in Bioinformatics: 6th International Work-shop, WABI 2006, Zurich, Switzerland, September 11-13, 2006. Pro-ceedings , chapter Procrastination Leads to Efficient Filtration for LocalMultiple Alignment, pages 126–137. Springer Berlin Heidelberg, Berlin,Heidelberg, 2006.[18] A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, andS. L. Salzberg. Alignment of whole genomes.
Nucleic Acids Research ,27:2369–2376, 1999.[19] C. N. Dewey and L. Pachter. Evolution at the nucleotide level: theproblem of multiple whole-genome alignment.
Human Molecular Ge-netics , 15:R51–R56, 2006.[20] A. D¨oring, D. Weese, T. Rausch, and K. Reinert. SeqAn – an efficient,generic C++ library for sequence analysis.
BMC Bioinformatics , 9:11,2008.[21] I. Dubchak, A. Poliakov, A. Kislyuk, and M. Brudno. Multiple whole-genome alignments without a reference organism.
Genome Research ,19:682–689, 2009.[22] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison.
Biological sequenceanalysis . Cambridge University Press, Cambridge, UK, 1998.[23] D. Earl, N. Nguyen, G. Hickey, R. S. Harris, S. Fitzgerald, K. Beal,I. Seledtsov, V. Molodtsov, B. J. Raney, H. Clawson, J. Kim, C. Ke-mena, J.-M. Chang, I. Erb, A. Poliakov, M. Hou, J. Herrero, W. J.16ent, V. Solovyev, A. E. Darling, J. Ma, C. Notredame, M. Brudno,I. Dubchak, D. Haussler, and B. Paten. Alignathon: a competitiveassessment of whole-genome alignment methods.
Genome Research ,24:2077–2089, 2014.[24] O. Gotoh. An improved algorithm for matching biological sequences.
J. Mol. Biol. , 162:705–708, 1982.[25] L. Hahn, C.-A. Leimeister, R. Ounit, S. Lonardi, and B. Morgenstern. rasbhari : optimizing spaced seeds for database searching, read map-ping and alignment-free sequence comparison.
PLOS ComputationalBiology , 12(10):e1005107, 2016.[26] B. Haubold, F. Kl¨otzl, and P. Pfaffelhuber. andi: Fast and accurateestimation of evolutionary distances between closely related genomes.
Bioinformatics , 31:1169–1175, 2015.[27] H. Hauswedell, J. Singer, and K. Reinert. Lambda: the local alignerfor massive biological data.
Bioinformatics , 30:i349–i355, 2014.[28] M. H¨ohl, S. Kurtz, and E. Ohlebusch. Efficient multiple genome align-ment.
Bioinformatics , 18:312S–320S, 2002.[29] S. Horwege, S. Lindner, M. Boden, K. Hatje, M. Kollmar, C.-A.Leimeister, and B. Morgenstern.
Spaced words and kmacs : fastalignment-free sequence comparison based on inexact word matches.
Nucleic Acids Research , 42:W7–W11, 2014.[30] W. Huang, D. M. Umbach, and L. Li. Accurate anchoring alignmentof divergent sequences.
Bioinformatics , 22:29–34, 2006.[31] S. Kurtz. Reducing the space requirement of suffix trees.
Software –Practice and Experience , 29:1149–1171, 1999.[32] S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. An-tonescu, and S. L. Salzberg. Versatile and open software for comparinglarge genomes.
Genome Biology , 5:R12+, 2004.[33] C.-A. Leimeister, M. Boden, S. Horwege, S. Lindner, and B. Morgen-stern. Fast alignment-free sequence comparison using spaced-word fre-quencies.
Bioinformatics , 30:1991–1999, 2014.[34] C.-A. Leimeister, S. Sohrabi-Jahromi, and B. Morgenstern. Fast andaccurate phylogeny reconstruction using filtered spaced-word matches.
Bioinformatics , 10.1093/bioinformatics/btw776 (in press).1735] M. Li, B. Ma, D. Kisman, and J. Tromp. PatternHunter II: Highlysensitive and fast homology search.
Genome Informatics , 14:164–175,2003.[36] B. Ma, J. Tromp, and M. Li. PatternHunter: faster and more sensitivehomology search.
Bioinformatics , 18:440–445, 2002.[37] B. Morgenstern. A simple and space-efficient fragment-chaining algo-rithm for alignment of DNA and protein sequences.
Applied Mathemat-ics Letters , 15:11–16, 2002.[38] B. Morgenstern, A. Dress, and T. Werner. Multiple DNA and proteinsequence alignment based on segment-to-segment comparison.
Proceed-ings of the National Academy of Sciences , 93:12098–12103, 1996.[39] B. Morgenstern, S. J. Prohaska, D. P¨ohler, and P. F. Stadler. Multiplesequence alignment with user-defined anchor points.
Algorithms forMolecular Biology , 1:6, 2006.[40] B. Morgenstern, O. Rinner, S. Abdedda¨ım, D. Haase, K. Mayer,A. Dress, and H.-W. Mewes. Exon discovery by genomic sequence align-ment.
Bioinformatics , 18:777–787, 2002.[41] B. Morgenstern, B. Zhu, S. Horwege, and C.-A. Leimeister. Estimatingevolutionary distances between genomic sequences from spaced-wordmatches.
Algorithms for Molecular Biology , 10:5, 2015.[42] S. B. Needleman and C. D. Wunsch. A general method applicable tothe search for similarities in the amino acid sequence of two proteins.
J. Mol. Biol. , 48:443–453, 1970.[43] A. Y. Ogurtsov, M. A. Roytberg, S. A. Shabalina, and A. S. Kon-drashov. OWEN: aligning long collinear regions of genomes.
Bioinfor-matics , 18:1703–1704, 2002.[44] B. Paten, D. Earl, N. Nguyen, M. Diekhans, D. Zerbino, and D. Haus-sler. Cactus: Algorithms for genome multiple sequence alignment.
Genome Research , 21:1512–1528, 2011.[45] B. Raphael, D. Zhi, H. Tang, and P. Pevzner. A novel method formultiple alignment of sequences with repeated and shuffled elements.
Genome Research , 14:2336 – 2346, 2004.1846] T. Rausch, A.-K. Emde, D. Weese, A. D¨oring, C. Notredame, andK. Reinert. Segment-based multiple sequence alignment.
Bioinformat-ics , 24:i187–i192, 2008.[47] T. F. Smith and M. S. Waterman. Identification of common molecularsubsequences.
Journal of Molecular Biology , 147:195–197, 1981.[48] The Arabidopsis Genome Initiative. Analysis of the genome sequenceof the flowering plant
Arabidopsis thaliana . Nature , 408:796–815, 2000.[49] H. Yi and L. Jin. Co-phylog: an assembly-free phylogenomic approachfor closely related organisms.