[PDF] Anchor points for genome alignment based on Filtered Spaced Word Matches

Abstract

Alignment of large genomic sequences is a fundamental task in computational genome analysis. Most methods for genomic alignment use high-scoring local alignments as {\em anchor points} to reduce the search space of the alignment procedure. Speed and quality of these methods therefore depend on the underlying anchor points. Herein, we propose to use {\em Filtered Spaced Word Matches} to calculate anchor points for genome alignment. To evaluate this approach, we used these anchor points in the the widely used alignment pipeline {\em Mugsy}. For distantly related sequence sets, we could substantially improve the quality of alignments produced by {\em Mugsy}.

Full PDF

AAnchor points for genome alignment based on

Filtered Spaced Word Matches

Chris-Andr´e Leimeister , Thomas Dencker , and BurkhardMorgenstern University of G¨ottingen, Department of Bioinformatics,Goldschmidtstr. 1, 37077 G¨ottingen, Germany University of G¨ottingen, Center for Computational Sciences,Goldschmidtstr. 7, 37077 G¨ottingen, GermanyNovember 16, 2018

Abstract

Alignment of large genomic sequences is a fundamental task in com-putational genome analysis. Most methods for genomic alignment usehigh-scoring local alignments as anchor points to reduce the searchspace of the alignment procedure. Speed and quality of these methodstherefore depend on the underlying anchor points. Herein, we proposeto use

Filtered Spaced Word Matches to calculate anchor points forgenome alignment. To evaluate this approach, we used these anchorpoints in the the widely used alignment pipeline

Mugsy . For distantlyrelated sequence sets, we could substantially improve the quality ofalignments produced by

Mugsy . Introduction

Sequence comparison is one of the most fundamental tasks in computa-tional biology. Here, a basic task is to align two or several DNA or proteinsequences – either globally , over their entire length, or locally , by restrictingthe alignment to a single region of homology. Standard approaches to se-quence alignment assume that the input sequences derived from a commonancestral sequence, and that evolutionary events are limited to substitutions,insertions and deletions of single residues or small sequence segments. In thiscase, sequence homologies can be represented by global sequence alignments ,1 a r X i v : . [ q - b i o . GN ] M a r hat is by inserting gap characters into the sequences such that evolutionarilyrelated sequence positions are arranged on top of each other. Under mostscoring schemes, calculating an optimal alignment of two sequences takestime proportional to the product of their lengths and is therefore limited torather short sequences [42, 47, 24, 37, 22].With the rapidly increasing number of partially or fully sequenced genomes,alignment of genomic sequences has become an important ﬁeld of researchin bioinformatics, see [23] for a recent review and evaluation of some of themost popular approaches. Here, the ﬁrst challenge is the sheer size of theinput sequence that makes it impossible to use traditional algorithms withquadratic run time. The second challenge is that related genomes often share multiple regions of local sequence homology, interrupted by non-conservedparts of the sequence where no signiﬁcant similarities can be detected. Thismeans that neither global nor local alignment methods can properly repre-sent the homologies between whole genomes. Finally, evolutionary eventssuch as duplications and large-scale rearrangements must be taken into ac-count. Since it is not possible, in general, to represent homologies amonggenomes in one single alignment, advanced genome aligners return align-ments of so-called Locally Collinear Blocks , i.e. blocks of segments of theinput sequences that contain the same genes in the same relative order.Since the late Nineteen Nineties, major eﬀorts have been made to a ad-dress the problem of genome alignment, and many approaches have beenpublished. One of the ﬁrst multiple-alignment programs that was appliedto genomic sequences was DIALIGN [38, 40]. This program composes mul-tiple alignments from chains of local pairwise alignments, and it does notpenalize gaps; it is therefore able to align sequences where local homologiesare separated by long non-homologous segments. The program has beenapplied, for example, to identify small non-coding functional elements in ge-nomic sequences [3, 12]. However, the program was initially not designed forlarge genomic sequences, and it is limited to sequences up to around 10 kb .Moreover, DIALIGN is not able to deal with duplications, rearrangementsand homologies on inverse strands of genomes.To align longer sequences, most programs for genomic alignment rely onsome sort of anchoring [30, 39], In a ﬁrst step, they use a fast method forlocal alignment to identify high-scoring local homologies, so-called anchorpoints . Next, chains of such local alignments are calculated and, ﬁnally,sequence segments between the chained high-scoring local alignments arealigned with a slower but more sensitive alignment method. For multiplesequence sets, anchor points can be deﬁned either between pairs of sequencesor between several or all of the input sequences. A pioneering tool to ﬁnd2nchor points for genomic alignment was

MUMmer [18]; the current versionof the program [32] is considered the state-of-the-art in alignment anchoring.MUMmer uses maximal unique matches as pairwise anchor points to aligngenomic sequences or protein sequences. By contrast,

MGA [28] is a tool for multiple alignment of genomic sequences that uses maximal exact matches between all sequences within a given sequence set. Both

MUMmer and

MGA use suﬃx trees [31] to rapidly identify pairs or blocks of identicalwords, one word from each of the sequences, that are then used as anchorpoints. Both programs are able to align entire bacterial genomes,

MUMmer was also used in the

A. thaliana genome project [48]. However, since theprobability of homologous exact matches rapidly decreases with increasingdivergence, they are most useful to compare closely related genomes, suchas diﬀerent strains of

E. coli .Other approaches to genome alignment are

OWEN [43],

AVID [7],

MAVID [8],

LAGAN and Multi-LAGAN [10],

CHAOS/DIALIGN [9], the

VISTAgenome pipeline [21],

TBA [5] and

Mauve [15], see [19, 4] for a review. Allof these methods are based on alignment anchoring, and most of them areable to deal with duplications and genome rearrangements. Some methodsfor genomic alignments are based on statistical properties of the sequences[6, 15]. Other methods are based on graphs , for example on

A-Bruijn graphs [45] or on cactus graphs [44]. A further development of

Mauve called pro-gressiveMauve uses palindromic spaced seeds instead of exact word matchesas anchor points [16]. That is, for a given binary pattern of length (cid:96) repre-senting match and don’t-care positions, one searches for a set of (cid:96) -mers, one (cid:96) -mer from each of the input sequences, such that all (cid:96) -mers have matchingnucleotides at the match positions. At the don’t-care positions, mismatchesare allowed. Palindromic patterns are used to cover both strands of theinput sequences. Spaced seeds are used in database searching [36, 17] andalignment-free sequence comparison [33] since they have been shown to leadto better results than contiguous word matches.

Mugsy [2] is a popular software pipeline for multiple whole-genome align-ment. In a ﬁrst step, this program uses

Nucmer [32] to construct all pairwisealignments of the input sequences.

Nucmer , in turn, uses

MUMmer to ﬁndexact unique word matches which are used as alignment anchor points. An alignment graph is constructed from these pairwise alignments using the

Se-qAn software [20], and

Locally Collinear Blocks are constructed. Finally,a multiple alignment is calculated using

SeqAn::TCoﬀee [46].

Mugsy hasbeen designed to align closely related genomes, such as diﬀerent strains of abacterium. Here, it produces alignments of high quality. On more distantlyrelated genomes, however, the program is often outperformed by other mul-3iple genome aligners [23].Finding anchor points is the most important step in whole-genome se-quence alignment. Here, a trade-oﬀ between speed , sensitivity and precision is necessary. A suﬃcient number of anchor points is required in order toreduce the search space and thereby the run time for the subsequent, moresensitive alignment routine. Wrongly chosen anchor points, on the otherhand, can substantially deteriorate the quality of the ﬁnal output align-ment. If spurious similarities are used as anchor points, this not only resultsin non-homologous parts of the sequences being aligned. Wrong anchorpoints may also prevent the program from aligning biologically relevant,true homologies since aligning them may be incompatible with the selectedanchors. Also, if the number of anchor points is too large, ﬁnding optimalchains of anchor points can become computationally expensive.In this paper, we propose a novel algorithm to ﬁnd pairwise anchorpoints for genomic alignments that is based on the Filtered Spaced WordMatches (FSWM) idea that we previously introduced [34]. Anchor pointsare calculated using a hit-and-extend approach where high-scoring spaced-word matches are used as seeds : for an underlying binary pattern of length (cid:96) representing match and don’t care positions, we rapidly identify spaced-wordmatches , i.e. length- (cid:96) segment pairs from the input sequences with matchingnucleotides at the match positions but with possible mismatches at the don’tcare positions. For each spaced-word match, we then calculate a similarityscore considering all aligned positions – including the don’t-care positions –, and we keep only those spaced-word matches that have a score above acertain threshold. These segment pairs are then extended to locally-maximalgap-free alignments, similar as in BLAST [1]. To evaluate our anchoringapproach, we used the

Mugsy pipeline using our software in the initial step,to ﬁnd anchor points. For closely related input sequences, the quality of theresulting alignments is comparable to the original version of

Mugsy whereexact word matches are used for anchoring. Our paproach is far superior,however, if distal sequences are to be aligned, where most other alignmentapproaches either fail to produce alignments or require an unacceptableamount of time.Through our web site, we provide the adapted

Mugsy pipeline with ouranchoring approach as a pipeline for genome-sequence alignment that can bereadily installed. A standalone version of our spaced-words software is pro-vided as well, such that developers can integrate it into their own sequence-analysis pipelines. 4 iltered Spaced Word Matches

For a sequence S of length L over an alphabet Σ and 0 < i ≤ L , S [ i ] denotesthe i -th symbol of S . For integers w ≤ (cid:96) , a binary pattern P of length (cid:96) and weight w is a word over { , } of length (cid:96) such that there are exactly w indices i with P [ i ] = 1. These positions are called match positions , whilepositions i with P [ i ] = 0 are called don’t-care positions . A spaced word withrespect to a pattern P is a word w over Σ ∪ {∗} where ‘ ∗ ’ is a wildcardcharacter not contained in Σ, and w [ k ] = ∗ holds if and only if k is a don’t-care position , i.e. if P [ k ] = 0, see also [33, 29]. A spaced word w with respectto a pattern P occurs in a sequence S at position i if S [ i + k −

1] = w [ k ] forall match positions k of the pattern P .For sequences S and S with lengths L and L , respectively and apattern P of length (cid:96) , and 1 ≤ i ≤ L − (cid:96) + 1 , ≤ j ≤ L − (cid:96) + 1, we say thatthere is spaced-word match between S and S at ( i, j ) with respect to P ifthe spaced words at i in S and at j in S are identical - in other words, iffor all match positions k in P , one has S [ i + k −

1] = S [ j + k − . Below is a spaced-word match between two

DNA sequences S and S at(5 ,

2) with respect to the pattern P = 1100101: S : G C T G T A T A C G T CS : S T A C A C T T A TP : 1 1 0 0 1 0 1Indeed, the spaced word ‘

T A ∗ ∗ C ∗ T ’ occurs at positions 5 in S and atposition 2 in S .Herein, we propose to use spaced-word matches as a ﬁrst step to calcu-late anchor points for pairwise alignment. We therefore need some criterionto distinguish between spaced-word matches representing true homologies and random background matches. In a previous paper, we used spaced-wordmatches to estimate phylogenetic distances between genomic sequences [34].To this end, we ﬁrst identiﬁed all spaced-word matches with respect to agiven pattern P . To remove spurious random spaced-word matches, we ap-plied a simple ﬁltering procedure : using a nucleotide substitution matrix [13],we calculated for each spaced-word match the sum scores of all aligned pairsof nucleotides (including match and don’t-care positions), and we removedall spaced-word matches with a score below zero.A graphical representation of the spaced-word matches between two se-quences shows that this procedure can clearly separate random spaced-word5igure 1: Spaced-words histogram for a comparison of two bacterialgenomes,

Phaeobacter gallaeciensis

Rhodobacterales bacterium Y4I .All possible spaced-word matches with respect to a given binary pattern P are identiﬁed, and their scores are calculated as explained in the main text.The number of spaced-word matches with a score s is plotted against s . Twopeaks are visible, an approximately normally distributed peak for back-ground spaced-word matches, and a more complex peak for spaced-wordmatches representing homologies. With a cut-oﬀ value of zero, backgroundand homologous spaced-word matches can be reliably separated.matches from true homologies. If we plot for each possible score value s thenumber of spaced-word matches with score s , we obtain a bimodal distribu-tion with one peak for random matches and a second peak for homologies.We call such a plot a spaced-words histogram . For simulated sequence pairsunder a simple model of evolution, both peaks are normally distributed. Forreal-world sequences, the random peak is still normally distributed, but the‘homologous’ peak is more complex, see Figure 1. Even so, using a cut-oﬀ value of zero can clearly distinguish between random matches and truehomologies. More examples for spaced-words histograms are given in [34].Our approach to ﬁnd anchor points for pairwise genomic alignment is asfollows. For given parameters (cid:96) and w , we ﬁrst calculate a binary pattern6ith length (cid:96) and weight (number of match positions) w using our recentlydeveloped software rasbhari [25]. We then identify all spaced-word matcheswith respect to P . To ﬁnd homologies even for distantly related sequences,we use patterns with a low weight; by default, we use a weight of w = 10.On the other hand, we use a large number of don’t-care positions, since thismakes it easier to distinguish true homologies from random spaced-wordmatches. By default, we use a pattern length of (cid:96) = 110, so our patternscontain 10 match positions and 100 don’t-care positions; we use the followingnucleotide substitution matrix described in [13]: A C G TA − − − C − − G − T score of each spaced-word match asthe sum of the substitution scores of all aligned pairs of nucleotides. Wethen discard all spaced-word matches with a score below zero.Next, we extend the identiﬁed spaced-word matches in both directionswithout gaps. As the starting point for this extension, we do not use the fullspaced-word matches, but their mid points. The reason for this is that, withour long patterns, even a high-scoring spaced-word match may not representsequence homologies over its entire length. It often occurs that some partof a spaced-word alignes homologous nucleotides, but another part extendsinto non-homologous regions of the sequences. There is a high probability,however, that the mid point of a long, high-scoring spaced-word match islocated within a region of true homology. Finally, we use the produced‘extended’ gap-free alignments as anchor points for alignment. Evaluation

To evaluate

Filtered Spaced Word Matches (FSWM) and to compare it tothe state-of-the-art approach to alignment anchoring, we used the

Mugsy software system. As mentioned above, the original

Mugsy uses

MUMmer toﬁnd pairwise anchor points. We replaced

MUMmer in the

Mugsy pipeline byour

FSWM -based anchor points and evaluated the resulting multiple align-ments. In addition, we compared these alignments to alignments producedby the multiple genome aligner

Cactus [44].

Cactus is known to be one ofthe best existing tools for multiple genome alignment; it performed excellent7n the

Alignathon study [23]. To measure the performance of the comparedmethods, we used simulated genomic sequences as well as three sets of realgenomes. To make

MUMmer directly comparable to

FSWM , we used a min-imum length of 10 nt for maximum unique matches, corresponding to thedefault weight (sum of match positions ) used in Spaced Words . Note that,by default,

MUMmer uses a minimum length of 15 nt . With this defaultvalue, however, we obtained alignments of much lower quality. The Cactus tool was run with default values.

Simulated genome sequences

To simulate genomic sequences, we used the

Artiﬁcial Life Framework (ALF) developed by Dalquen et al. [14].

ALF evolves gene sequences based ona probabilistic model along a randomly generated tree, starting with anancestral gene. During this process evolutionary events are logged such thatthe true

MSA is known for each simulated gene family. This true

MSA canthen be used as reference to assess the quality of automatically generatedalignments.We generated a series of 14 data sets, each containing 30 simulated‘genomes’, with increasing mutation rates for the diﬀerent data sets. Forall other parameters in

ALF , we used the default settings. In each data set,there are 750 simulated gene families such that one gene from each genefamily is present in each of the 30 simulated genomes. Thus, each of the‘genomes’ contains the same set of 750 genes. We varied the mutation ratesbetween an average of 0.1013 substitutions per position for the ﬁrst dataset to an average of 0.8349 substitutions per position for the 14th data set.The maximal pairwise distances between all pairs of sequences within onedata set ranges from 0.1640 for the ﬁrst to 1.0923 for the 14th data set. Thesimulated genes have an average length of about 1500 bp , summing up to atotal size of about 32 MB per data set.To assess the quality of the produced alignments, we calculated recall and precision values in the usual way. If, for one given data set, S is theset of all positions in the 30 simulated genomes, we denote by A ⊂ S theset of all pairs of positions aligned by the alignment that is to be evaluatedwhile R ⊂ S denotes the set of all pairs of positions aligned in the referencealignment. recall and precision are then deﬁned as recall = | A ∩ R || R | , precision = | A ∩ R || A | (1)The harmonic mean of reall and precision is called the balanced F-score and8s often used as an overall measure of accuracy; it is thus deﬁned as F score = 2 × precision × recallprecision + recall To estimate these three values, we used the tool mafComparator which wasalso used in the

Alignathon study [23]. Since it is impractical to considerthe entire set S of pairs of positions of the test sequences, we sampled10 million pairs of positions for each data set. This corresponds to theevaluation procedure used in Alignathon .For the simulated sequence sets, their precision and recall values areshown in Figure 2. For data sets with smaller mutation rates, alignmentsobtained with

FSWM are only slightly better than those obtained with

MUMmer . However, if the mutation rate increases, our spaced-words ap-proach substantially outperforms the original version of

Mugsy where exactword matches are used to ﬁnd anchor points. Not only more homologiesare detected but also the precision is slightly higher if

Filtered Spaced WordMatches is used instead of

MUMmer . Real-world genome sequences

For real-world genome families, it is usually not possible to calculate the precision of MSA programs because it is, in general, not known which se-quence positions exactly are homologous to each other and which ones arenot. If there are core blocks of the sequences for which the biologically cor-rect alignment is known, at least the recall can be calculated for these coreblocks. For most genome sequences, however, no such core blocks are avail-able. To evaluate

Mugsy , the authors of the program used the number of core columns of the produced alignments as a criterion for alignment quality[2]. Here, a core column is deﬁned as a column that does not contain gaps, i.e. a column that aligns nucleotides from all of the input sequences. Inaddition, the authors of

Mugsy used the number of pairs of aligned positions of the aligned sequences as an indicator of alignment quality. In this paper,we are using the same criteria to evaluate multiple alignments of real-worldgenomes.As a ﬁrst real-word example, we used a set of 29

E.coli/Shigella genomesthat has already been used in the original

Mugsy paper, see supplemen-tary material for details; these sequences have also been used to evaluatealignment-free methods [26, 49, 41]. The total size of this data set is about141 MB . As a second test set, we used another prokaryotic data set whichconsists of 32 complete Roseobacter genomes (details in the supplementary R e c a ll average pairwise distance MUMs+mugsyspacedAnchors (k=10)+mugsyspacedAnchors (k=8)+mugsycactus P r e c i s i o n average pairwise distance MUMs+mugsyspacedAnchors (k=10)+mugsyspacedAnchors (k=8)+mugsycactus

Figure 2:

Recall and

Precision of Mugsy with anchor points from

FilteredSpaced Word Matches (FSWM) and

MUMmer , respectively, and of

Cactus on simulated genomic sequences generated with

ALF , see main text fordetails.

FSWM was used with the default weight w = 10, i.e. with 10 matchpositions in the underlying pattern. In addition, we ran FSWM with w = 8. material ). This data set was used to assess the performance on more dis-tantly related organisms than the E.coli/Shigella strains. The total size ofthese data set is about 135 MB . To test our approach on eukaryotic genomes,we used as a third test case a set of nine fungal genomes, namely Coprinopsis F ‐ s c o r e average pairwise distance MUMs+mugsyspacedAnchors (k=10)+mugsyspacedAnchors (k=8)+mugsycactus

Figure 3:

F-Score of Mugsy with anchor points from

Filtered Spaced WordMatches and

MUMmer , respectively, and of

Cactus on simulated genomicsequences generated with

ALF . cinerea , Neurospora crassa , Aspergillus terreus , Aspergillus nidulans , Histo-plasma capsulatum , Paracoccidioides brasiliensis , Saccharomyces cerevisiae , Schizosaccharomyces pombe and

Ustilago maydis (genbank accession num-bers are given in the supplementary material). The total size of this thirddata set is about 253 MB . The results of Mugsy with

MUMmer and

FSWM for the three real-world data sets are shown in Table 1, together with theresults obtained with

Cactus . In addition to the number of core columns and the number of aligned pairs of positions, the table contains the num-ber of core Locally Collinear Blocks , i.e. the number of Locally CollinearBlocks involving all of the input sequences, and the total number of

LocallyCollinear Blocks returned by the alignment programs.

Program run time

Table 2 reports the program run times of

Mugsy with

FSWM , Mugsy with

MUMmer and

Cactus on the above three real-world sequence sets. In addi-tion, the table contains the run times for

FSWM and

MUMmer alone.11 core LCBs

E.coli/Shigella genomes

Mugsy + MUMmer

539 1,61E+09 2,827,115 4,138

Mugsy + FSWM

664 1,63E+09 2,867,432 5,906

Cactus

Roseobacter genomes

Mugsy + MUMmer

39 3,63E+08 13,654 13,501

Mugsy + FSWM

859 7,15E+08 824,054 30,836

Cactus

Mugsy + MUMmer

Mugsy + FSWM

Cactus

E.coli/Shigella genomes, 32

Roseobacter genomes and 9 fungal genomes, calculated with

Mugsy using anchor pointsfrom our spaced-words approach and from

MUMs , respectively, and with

Cactus . The ﬁrst column contains the number of core columns , i.e. thenumber of columns in the multiple alignment that do not contain gaps; thesecond column contains the total number of aligned pairs of positions in thealignment. The third column contains the number of core Locally CollinearBlocks (LCBs) i.e. the number of LCBs that involve all of the alignedgenomes (‘core LCBs’), while the last column contains the total number of

LCBs . In this paper, we proposed a novel approach to calculate anchor points forgenome alignment. Finding suitable anchor points is a critical step in allmethods for genome alignment, since the selected anchor points determinewhich regions of the sequences can be aligned to each other in the ﬁnalalignment. A suﬃcient number of anchor points is necessary to keep thesearch space and run time of the main alignment procedure manageable, so sensitive methods are needed to ﬁnd anchor points. Wrongly selected anchorpoints, on the other hand, can seriously deteriorate the quality of the ﬁnalalignments, so anchoring procedures must also be highly speciﬁc .Earlier approaches to genomic alignment used exact word matches as an-chor points [18, 28], since such matches can be easily found using suﬃx treesand related indexing structures. These approaches are limited, however, to12 .coli/Shigella Roseobacter fungal genomesFSWM

59 83 110

FSWM + Mugsy

638 6428 1488

MUMmer

73 63 43

MUMmer + Mugsy

286 1099 63

Cactus

714 1775 775Table 2: Run time in minutes for three diﬀerent multiple genome-alignmentmethods applied to the three test data sets that we used in our programevaluation.situations where closely related genomes are to be aligned, for example dif-ferent strains of a bacterium. In modern approaches to database searching, spaced seeds are used to ﬁnd potential sequence homologies [35, 27, 11].Here, binary patterns of match and don’t care positions are used, and twosequence segments of the corresponding length are considered to match ifidentical residues are aligned at the match positions, while mismatches areallowed at the don’t care positions. Such pattern-based approaches are more sensitive than previous methods that relied on exact word matches.We previously proposed to apply the ‘spaced-seeds’ idea to alignment-free sequence comparison, by replacing contiguous words by so-called spacedwords , i.e. by words that contain wildcard characters at certain pre-deﬁnedpositions [33]. More recently, we introduced ﬁltered spaced word matches [34] to estimate phylogenetic distances between genome sequences. In thelatter approach, we ﬁrst identify spaced-word matches using relatively longpatterns with only few match positions. For the identiﬁed matching seg-ments, we then look at all aligned pairs of nucleotides, including the ones atthe don’t-care positions, and we discard spaced-word matches if the overalldegree of similarity between the two segments is below a threshold. Phylo-genetic distances can be estimated based on the aligned nucleotides at thedon’t-care positions of the remaining spaced-word matches. We showed thatthis procedure is fast and highly sensitive, and it can reliably distinguish be-tween true homologies and spurious sequence similarities.In the present study, we used ﬁltered spaced word matches to calculatehigh-quality anchor points for genomic sequence alignment. Instead of usingspaced-word matches directly as anchor points, we extend them into bothdirections, similar to the hit-and-extend approach to database searching. Toevaluate these anchor points, we integrated them into the popular genome-alignment pipeline Mugsy . Test runs on simulated genome sequences showthat, for closely related sequences,

Mugsy produces alignments of high qual-13ty with both types of anchor points. For more distantly related sequences,however, the recall values of the program drop dramatically if anchor pointsare calculated with

MUMmer while, with our spaced-word matches, one ob-serves recall values close to 100% for distances up to around 0.7 substitutionsper position.For real-world genomes, it is more diﬃcult to evaluate the performanceof genome aligners since there is only limited information available on whichpositions are homologous to each other and which ones are not. Angiuoliand Salzberg [2] therefore used the number of aligned pairs of positionsas an indicator of alignment quality, together with the size of the ‘corealignment’, i.e. the number of alignments columns that do not contain gaps.At ﬁrst glance, these criteria might seem questionable; it would be trivial tomaximize these values, simply by aligning sequences without internal gaps,by adding gaps only at the ends of the shorter sequences. However, as shownin Figure 2, all MSA programs in our study have high precision values, i.e. positions aligned by these programs are likely to be true homologs.In this situation, the number of aligned position pairs and size of the ‘corealignment’ can be considered as a proxy for the recall of the applied methods i.e. the proportion of homologies that are correctly aligned.For distantly related sequence sets, the total run time of

Mugsy is muchhigher with our

FSWM anchoring approach than with

MUMmer . One rea-son for the increased run time with

FSWM is the fact that, with spaced-words, far more

Locally Collinear Blocks are detected, than if exact wordmatches are used as anchor points, especially for distantly related sequenceswhere exact word matching is not very sensitive. One possible solutionfor this issue would be to apply user-deﬁned threshold values for the totalnumber of returned

Locally Collinear Blocks or for their similarity scores,to reduce the run time of the ﬁnal alignment procedure for large genomicsequences.

References [1] S. F. Altschul, W. Gish, W. Miller, E. M. Myers, and D. J. Lipman. Ba-sic local alignment search tool.

Journal of Molecular Biology , 215:403–410, 1990.[2] S. V. Angiuoli and S. L. Salzberg. Mugsy: fast multiple alignment ofclosely related whole genomes.

Bioinformatics , 27:334–342, 2011.143] L. M. Barton, B. G¨ottgens, M. Gering, J. G. Gilbert, D. Grafham,J. Rogers, D. Bentley, R. Patient, and A. R. Green. Regulation of thestem cell leukemia (SCL) gene: a tale of two ﬁshes.

Proc. Natl. Acad.Sci. USA , 98:6747–6752, 2001.[4] S. Batzoglou. The many faces of sequence alignment.

Brieﬁngs inBioinformatics , 6:6–22, 2005.[5] M. Blanchette, W. J. Kent, C. Riemer, L. Elnitski, A. F. A. Smit,K. M. Roskin, R. Baertsch, K. Rosenbloom, H. Clawson, E. D. Green,D. Haussler, and W. Miller. Aligning multiple genomic sequences withthe threaded blockset aligner.

Genome Research , 14:708–715, 2004.[6] R. K. Bradley, A. Roberts, M. Smoot, S. Juvekar, J. Do, C. Dewey,I. Holmes, and L. Pachter. Fast statistical alignment.

PLOS ComputBiol , 5:e1000392, 2009.[7] N. Bray, I. Dubchak, and L. Pachter. AVID: A global alignment pro-gram.

Genome Research , 13:97–102, 2003.[8] N. Bray and L. Pachter. MAVID multiple alignment server.

NucleicAcids Research , 31:3525–3526, 2003.[9] M. Brudno, M. Chapman, B. G¨ottgens, S. Batzoglou, and B. Morgen-stern. Fast and sensitive multiple alignment of large genomic sequences.

BMC Bioinformatics , 4:66, 2003.[10] M. Brudno, C. Do, G. Cooper, M. Kim, E. Davydov, NISC SequencingConsortium, E. Green, A. Sidow, and S. Batzoglou. LAGAN and multi-LAGAN: Eﬃcient tools for large-scale multiple alignment of genomicDNA.

Genome Research , 13:721–731, 2003.[11] B. Buchﬁnk, C. Xie, and D. H. Huson. Fast and sensitive proteinalignment using DIAMOND.

Nature Methods , 12:59–60, 2015.[12] M. A. Chapman, F. J. Charchar, S. Kinston, C. P. Bird, D. Grafham,J. Rogers, F. Gr¨utzner, J. A. M. Graves, A. R. Green, and B. G¨ottgens.Comparative and functional analysis of LYL1 loci establish marsupialsequences as a model for phylogenetic footprinting.

Genomics , 81:249–259, 2003.[13] F. Chiaromonte, V. B. Yap, and W. Miller. Scoring pairwise genomicsequence alignments. In R. B. Altman, A. K. Dunker, L. Hunter, and15. E. Klein, editors,

Paciﬁc Symposium on Biocomputing , pages 115–126, 2002.[14] D. A. Dalquen, M. Anisimova, G. H. Gonnet, and C. Dessimoz. ALF- a simulation framework for genome evolution.

Molecular Biology andEvolution , 29:1115–1123, 2012.[15] A. C. E. Darling, B. Mau, F. R. Blattner, and N. T. Perna. Mauve:multiple alignment of conserved genomic sequence with rearrangements.

Genome Research , 14:1394–1403, 2004.[16] A. E. Darling, B. Mau, and N. T. Perna. progressiveMauve: MultipleGenome Alignment with Gene Gain, Loss and Rearrangement.

PLOSONE , 5:e11147+, 2010.[17] A. E. Darling, T. J. Treangen, L. Zhang, C. Kuiken, X. Messeguer, andN. T. Perna.

Algorithms in Bioinformatics: 6th International Work-shop, WABI 2006, Zurich, Switzerland, September 11-13, 2006. Pro-ceedings , chapter Procrastination Leads to Eﬃcient Filtration for LocalMultiple Alignment, pages 126–137. Springer Berlin Heidelberg, Berlin,Heidelberg, 2006.[18] A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, andS. L. Salzberg. Alignment of whole genomes.

Nucleic Acids Research ,27:2369–2376, 1999.[19] C. N. Dewey and L. Pachter. Evolution at the nucleotide level: theproblem of multiple whole-genome alignment.

Human Molecular Ge-netics , 15:R51–R56, 2006.[20] A. D¨oring, D. Weese, T. Rausch, and K. Reinert. SeqAn – an eﬃcient,generic C++ library for sequence analysis.

BMC Bioinformatics , 9:11,2008.[21] I. Dubchak, A. Poliakov, A. Kislyuk, and M. Brudno. Multiple whole-genome alignments without a reference organism.

Genome Research ,19:682–689, 2009.[22] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison.

Biological sequenceanalysis . Cambridge University Press, Cambridge, UK, 1998.[23] D. Earl, N. Nguyen, G. Hickey, R. S. Harris, S. Fitzgerald, K. Beal,I. Seledtsov, V. Molodtsov, B. J. Raney, H. Clawson, J. Kim, C. Ke-mena, J.-M. Chang, I. Erb, A. Poliakov, M. Hou, J. Herrero, W. J.16ent, V. Solovyev, A. E. Darling, J. Ma, C. Notredame, M. Brudno,I. Dubchak, D. Haussler, and B. Paten. Alignathon: a competitiveassessment of whole-genome alignment methods.

Genome Research ,24:2077–2089, 2014.[24] O. Gotoh. An improved algorithm for matching biological sequences.

J. Mol. Biol. , 162:705–708, 1982.[25] L. Hahn, C.-A. Leimeister, R. Ounit, S. Lonardi, and B. Morgenstern. rasbhari : optimizing spaced seeds for database searching, read map-ping and alignment-free sequence comparison.

PLOS ComputationalBiology , 12(10):e1005107, 2016.[26] B. Haubold, F. Kl¨otzl, and P. Pfaﬀelhuber. andi: Fast and accurateestimation of evolutionary distances between closely related genomes.

Bioinformatics , 31:1169–1175, 2015.[27] H. Hauswedell, J. Singer, and K. Reinert. Lambda: the local alignerfor massive biological data.

Bioinformatics , 30:i349–i355, 2014.[28] M. H¨ohl, S. Kurtz, and E. Ohlebusch. Eﬃcient multiple genome align-ment.

Bioinformatics , 18:312S–320S, 2002.[29] S. Horwege, S. Lindner, M. Boden, K. Hatje, M. Kollmar, C.-A.Leimeister, and B. Morgenstern.

Spaced words and kmacs : fastalignment-free sequence comparison based on inexact word matches.

Nucleic Acids Research , 42:W7–W11, 2014.[30] W. Huang, D. M. Umbach, and L. Li. Accurate anchoring alignmentof divergent sequences.

Bioinformatics , 22:29–34, 2006.[31] S. Kurtz. Reducing the space requirement of suﬃx trees.

Software –Practice and Experience , 29:1149–1171, 1999.[32] S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. An-tonescu, and S. L. Salzberg. Versatile and open software for comparinglarge genomes.

Genome Biology , 5:R12+, 2004.[33] C.-A. Leimeister, M. Boden, S. Horwege, S. Lindner, and B. Morgen-stern. Fast alignment-free sequence comparison using spaced-word fre-quencies.

Bioinformatics , 30:1991–1999, 2014.[34] C.-A. Leimeister, S. Sohrabi-Jahromi, and B. Morgenstern. Fast andaccurate phylogeny reconstruction using ﬁltered spaced-word matches.

Bioinformatics , 10.1093/bioinformatics/btw776 (in press).1735] M. Li, B. Ma, D. Kisman, and J. Tromp. PatternHunter II: Highlysensitive and fast homology search.

Genome Informatics , 14:164–175,2003.[36] B. Ma, J. Tromp, and M. Li. PatternHunter: faster and more sensitivehomology search.

Bioinformatics , 18:440–445, 2002.[37] B. Morgenstern. A simple and space-eﬃcient fragment-chaining algo-rithm for alignment of DNA and protein sequences.

Applied Mathemat-ics Letters , 15:11–16, 2002.[38] B. Morgenstern, A. Dress, and T. Werner. Multiple DNA and proteinsequence alignment based on segment-to-segment comparison.

Proceed-ings of the National Academy of Sciences , 93:12098–12103, 1996.[39] B. Morgenstern, S. J. Prohaska, D. P¨ohler, and P. F. Stadler. Multiplesequence alignment with user-deﬁned anchor points.

Algorithms forMolecular Biology , 1:6, 2006.[40] B. Morgenstern, O. Rinner, S. Abdedda¨ım, D. Haase, K. Mayer,A. Dress, and H.-W. Mewes. Exon discovery by genomic sequence align-ment.

Bioinformatics , 18:777–787, 2002.[41] B. Morgenstern, B. Zhu, S. Horwege, and C.-A. Leimeister. Estimatingevolutionary distances between genomic sequences from spaced-wordmatches.

Algorithms for Molecular Biology , 10:5, 2015.[42] S. B. Needleman and C. D. Wunsch. A general method applicable tothe search for similarities in the amino acid sequence of two proteins.

J. Mol. Biol. , 48:443–453, 1970.[43] A. Y. Ogurtsov, M. A. Roytberg, S. A. Shabalina, and A. S. Kon-drashov. OWEN: aligning long collinear regions of genomes.

Bioinfor-matics , 18:1703–1704, 2002.[44] B. Paten, D. Earl, N. Nguyen, M. Diekhans, D. Zerbino, and D. Haus-sler. Cactus: Algorithms for genome multiple sequence alignment.

Genome Research , 21:1512–1528, 2011.[45] B. Raphael, D. Zhi, H. Tang, and P. Pevzner. A novel method formultiple alignment of sequences with repeated and shuﬄed elements.

Genome Research , 14:2336 – 2346, 2004.1846] T. Rausch, A.-K. Emde, D. Weese, A. D¨oring, C. Notredame, andK. Reinert. Segment-based multiple sequence alignment.

Bioinformat-ics , 24:i187–i192, 2008.[47] T. F. Smith and M. S. Waterman. Identiﬁcation of common molecularsubsequences.

Journal of Molecular Biology , 147:195–197, 1981.[48] The Arabidopsis Genome Initiative. Analysis of the genome sequenceof the ﬂowering plant

Arabidopsis thaliana . Nature , 408:796–815, 2000.[49] H. Yi and L. Jin. Co-phylog: an assembly-free phylogenomic approachfor closely related organisms.