[PDF] Aligning 415 519 proteins in less than two hours on PC

Abstract

Rapid development of modern sequencing platforms enabled an unprecedented growth of protein families databases. The abundance of sets composed of hundreds of thousands sequences is a great challenge for multiple sequence alignment algorithms. In the article we introduce FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilisation of longest common subsequence measure for determining pairwise similarities, a novel method of gap costs evaluation, and a new iterative refinement scheme. Importantly, its implementation is highly optimised and parallelised to make the most of modern computer platforms. Thanks to the above, quality indicators, namely sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms like Clustal Omega or MAFFT for datasets exceeding a few thousand of sequences. The quality does not compromise time and memory requirements which are an order of magnitude lower than that of existing solutions. For example, a family of 415 519 sequences was analysed in less than two hours and required only 8GB of RAM. FAMSA is freely available at this http URL.

Full PDF

AAligning 415 519 proteins in less than two hourson PC

Sebastian Deorowicz , Agnieszka Debudaj-Grabysz , and Adam Gudy ´s Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland * [email protected] ABSTRACT

Rapid development of modern sequencing platforms enabled an unprecedented growth of protein families databases. Theabundance of sets composed of hundreds of thousands sequences is a great challenge for multiple sequence alignmentalgorithms. In the article we introduce FAMSA, a new progressive algorithm designed for fast and accurate alignment ofthousands of protein sequences. Its features include the utilisation of longest common subsequence measure for determiningpairwise similarities, a novel method of gap costs evaluation, and a new iterative reﬁnement scheme. Importantly, itsimplementation is highly optimised and parallelised to make the most of modern computer platforms. Thanks to the above,quality indicators, namely sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms likeClustal Omega or MAFFT for datasets exceeding a few thousand of sequences. The quality does not compromise time andmemory requirements which are an order of magnitude lower than that of existing solutions. For example, a family of 415 519sequences was analysed in less than two hours and required only 8GB of RAM.FAMSA is freely available at http://sun.aei.polsl.pl/REFRESH/famsa.

Introduction

The multiple sequence alignment (MSA) is one of the most important analyses in molecular biology. Majority of algorithmsfor the MSA problem conform to a progressive heuristics. The scheme includes three stages: (I) calculation of a similaritymatrix for investigated sequences, (II) a guide tree construction, (III) greedy alignment according to the order given by the tree.Pairwise similarities can be established variously. Some algorithms use accurate, but time-consuming methods like calculatingpairwise alignments of highest probability or maximum expected accuracy. Others employ approximated, though fasterapproaches, e.g., tuple matching.

As sizes of protein families to be analysed has been constantly increased, the necessityto calculate all pairwise similarities has become a bottleneck of alignment algorithms. Therefore, many attempts have beenmade to accelerate this stage. Kalign and Kalign2 employ for similarity measurement, respectively, Wu-Manber andMuth-Manber fast string matching algorithms. This allows thousands of sequences to be aligned in a reasonable time. Theidea has been further extended by the authors of the presented research in Kalign-LCS, which introduced to Kalign2 pipelinelongest common subsequence for similarity measurement. This improved both, alignment quality as well as execution time.Nevertheless, the recent developments in high throughput sequencing confront biologists with the necessity to align proteinfamilies containing tens of thousands of members. Progressive algorithms which calculate and store all pairwise similaritydistances, were inapplicable for problems of such sizes due to excessive time and memory requirements.An introduction of PartTree, a divisive sequence clustering algorithm for building a guide tree without calculating allpairwise similarities, was one of the ideas to tackle the problem. With average time complexity of O ( k log k ) and spacecomplexity of O ( k ) ( k is the number of sequences in the input set), PartTree was successfully adopted by MAFFT 6 package allowing tens of thousands of sequences to be aligned on a typical desktop computer. A different approach was presented inClustal Omega. It uses mBed, an algorithm for embedding sequences into a lower-dimensional space, which requires only O ( k log k ) exact similarity values to approximate others. The embedding is combined with sequence clustering with the use of K -means algorithm, which prevents from storing whole similarity matrix and keeps memory requirements under control. BothMAFFT and Clustal Omega use tuple matching for similarity calculation.While MAFFT and Clustal Omega are computationally applicable for families of even 100 000 proteins, we show thatquality of results for such problems is often unsatisfactory. In this paper we present FAMSA, a progressive multiple alignmentalgorithm especially suitable for large sets of sequences. Pairwise similarities are established, similarly to Kalign-LCS, on thebasis of longest common subsequences (LCS). Unlike MAFFT and Clustal Omega, FAMSA calculates all pairwise similarities,which is efﬁcient due to utilisation of multithreaded, bit-parallel LCS algorithm suited for AVX extensions of modernprocessors. Employing memory-saving single-linkage algorithm for guide tree construction, reduces memory requirementsof the ﬁrst stage to O ( k ) . An important factor contributing to the computational scalability of FAMSA is a novel, in-place a r X i v : . [ q - b i o . GN ] M a r lgorithm of proﬁle alignment which prevents memory reallocations during the progressive stage. As a result, FAMSA isthe fastest and most memory-efﬁcient alignment software when large protein families are of interest. The predominance wasobserved on sets ranging from thousands to a half million of sequences.The efﬁciency of FAMSA comes with superior accuracy. This is thanks to a number of algorithm features. They includeusing LCS for similarity measurement, MIQS substitution matrix, and a correction of gap penalties inspired by MUSCLE. The penalties are additionally adjusted to the set size which is a novel technique in alignment software, particularly proﬁtablefor large sets of sequences. Misalignments during progressive stage are ﬁxed with a use of reﬁnement scheme similar to theone included in QuickProbs 2. Consequently, when sets of a few thousands or more sequences are of interest, FAMSA issigniﬁcantly more accurate than any other algorithm. Importantly, the difference increases with growing number of sequences.E.g., for sets exceeding 25 000 proteins, FAMSA properly aligned 35% and 25% more columns than the most accurate variantsof MAFFT and Clustal Omega. When largest benchmark family containing 415 519 sequences was investigated, the advancewas even more remarkable—FAMSA successfully restored 4 times more columns than competitors, at a fraction of requiredtime and memory.Scalability of FAMSA was assessed on extHomFam, a new benchmark generated analogously to HomFam by enrichingHomstrad with families from PFam database. It contains 380 sets of sizes ranging from 218 to 415 519 sequences. Theabundance of numerous protein families ( k >

10 000) makes extHomFam particularly representative for large-scale alignmentproblems, which are of crucial importance in the face of recent advances in high throughput sequencing.

Methods

FAMSA, similarly to other progressive algorithms, is composed of four stages:I. Calculation of pairwise similarities,II. Determination of a guide tree,III. Progressive proﬁle merging according to the guide tree order,IV. Optional iterative reﬁnement of the ﬁnal proﬁle.Detailed descriptions of the algorithm stages together with analyses of time and space complexities are given in the followingsubsections.

Pairwise similarities calculation

To determine the pairwise similarities of sequences in the input set we use the length of a longest common subsequence(LCS). The choice was motivated by the promising results of LCS application to this task in the former studies.

Given twosequences A and B , the length of an LCS is the maximal number of perfectly matching columns. This can be considered as anestimation of true pairwise alignment. To compensate the effect of LCS length being larger for longer sequences, the value isnormalised by the indel distance (the number of single-symbol insertions and deletions necessary to transform one sequence toanother). This distance approximates the misalignment cost, i.e., the number of gaps in the alignment, in which only perfectmatches are allowed. To penalise the differences between two sequences more than reward the similarities, indel distance issquared, as in the former work on Kalign-LCS: similarity ( A , B ) = LCS len ( A , B ) indel ( A , B ) . The LCS length can be computed using a straightforward dynamic programming (DP) rule. Thanks to the internalproperties of the DP matrix, the calculation can be made using the bit-parallel approach, in which w cells are computed at atime ( w is a computer word size equal to 64 in modern architectures). The indel distance for the sequences A and B can bedirectly derived from the LCS length according to the formula: indel ( A , B ) = | A | + | B | − × LCS len ( A , B ) , where | S | denotes the length of the S sequence. The time complexity of the pairwise similarity calculation is: O (cid:18) | A || B | w (cid:19) , under reasonable assumption that w is comparable or smaller to the longer sequence length. s modern computers are equipped with multi-core processors, FAMSA distributes the calculation of LCS lengths fordifferent pairs of sequences to several computing threads. Additionally, presented software makes use of vector operationsprovided by technologies like SSE, AVX, AVX 2 which are supported by contemporary processors. This allows multiplepairs of sequences to be processed simultaneously by the same thread. Assuming t processing threads, a words in a single AVXvector (2 for AVX, 4 for AVX2, and 8 for announced AVX-512), and n being a sequence length, the total time complexity of theﬁrst stage can be expressed as: O (cid:18) n w × k ta (cid:19) . The utilisation of massively parallel architectures has become widespread in computationally demanding tasks. As FAMSAwas designed for analysing large protein families, it allows massively parallel devices like graphics processors to be employedfor calculation of pairwise similarities. The procedure is implemented in OpenCL, therefore it is suitable for GPUs produced byall main vendors including NVidia and AMD. Distributing LCS computation over thousands of graphics processor threadsfurther increases throughput of the ﬁrst FAMSA stage. Yet, as is shown in the experimental part of the article, even without theaid of OpenCL, FAMSA is able to process hundreds of thousands of proteins in very short time.

Determination of the guide tree

A number of algorithms for guide tree construction have been developed, e.g., NJ, UPGMA, single-linkage. FAMSAemploys the latter, which is motivated by following reasons: • it can be computed incrementally, i.e., without storing the complete similarity matrix, • it is very fast, i.e., can be completed in O ( k ) time using the SLINK algorithm, • it gave superior results in former studies. To beneﬁt from the incremental property of SLINK, ﬁrst two stages of FAMSA are performed simultaneously, which restrictsmemory footprint. Particularly, tree generation requires only O ( k ) space in contrast to O ( k ) needed by other guide treeconstruction algorithms like UPGMA. This is of crucial importance when huge protein families are investigated. Progressive construction of the alignment

Progressive construction stage requires O ( k ) proﬁle alignments, each computed with a use of dynamic programming. Atleast half of these alignments are degenerated cases in which one or both proﬁles consist of a single sequence. As dynamicprogramming implementation can be simpliﬁed in those cases, we prepared specialised variants of the general DP procedure.This gave remarkable computation time savings for huge datasets, in which due to the structure of a guide tree, the majority ofproﬁle alignments are made against a single sequence.Several improvements to the classical computation rule were introduced in FAMSA to increase alignment quality as well asthe processing speed. They were possible thanks to the internal proﬁle representation composed of three arrays storing: • occurrence counters of each alphabet symbol in consecutive columns (occupying 32 n ∗ computer words, with n ∗ beingthe proﬁle length), • costs of alignment of consecutive columns to each possible alphabet symbol (occupying also 32 n ∗ computer words), • sequences in the gapped representation .While two former components were previously employed by alignment algorithms, e.g., Kalign, the gapped representation is,to the best of our knowledge, a novel technique. In this representation, for each sequence, two equal-sized arrays are stored: ( i )sequence symbols, ( ii ) a number of gaps present before the corresponding sequence symbol. Moreover, to quickly localise asymbol in a column, as well as to insert or remove gaps, a dynamic position statistics are stored in an additional array. The spacefor the gapped sequence is approximately 13 times the length of the sequence (see Figure 1 for example). The proposed proﬁlerepresentation allows a dynamic programming matrix to be computed rapidly and is memory frugal. The DP computation stepfor a pair of proﬁles takes O ( n n σ ) time, where n , n are the input proﬁle lengths and σ is the alphabet size (equals 32 to represent twenty amino acids and severalspecial symbols).The presence of gapped representation is especially beneﬁcial when families containing tens of thousands proteins areinvestigated. Other aligners construct new proﬁle by copying sequences symbol by symbol with occasional gap insertions, C A H F Q G A C D L M F A P S2 0 2 1 3 1 0 0 2 0 0 4 0 1 1 04 5 6 2 4 6 3 39 8 10 617 1633 dps [ ] dps [ . . . ] dps [ . . . ] dps [ . . . ] no gaps [ . . . ] sequence [ . . . ] Figure 1.

Illustration of gapped sequence representation of – – C A – – H – F – – – Q – G A C – – D L M – – – – F A – P – S .The ‘+’ symbol is a guard present to simplify the implementation. The values of dps are computed according to the rule: dps [ i ] = dps [ i ] + dps [ i + ] , if the necessary cells are present. Otherwise they are calculated on the no gaps and sequence vectors. E.g., dps [ ] is the number of symbols in sequence [ . . . ] (equal 2) incremented by the number of gaps present justbefore these symbols, i.e., no gaps [ ] and no gaps [ ] .which starts to be a bottleneck for large-scale analyses. This is not the case in FAMSA in which whole sequences are movedfrom the input proﬁles to the new one and gaps are rapidly inserted by updating gap counters in corresponding arrays. The timeof construction of a new proﬁle is: O ( k o + n o σ + k ( n o − n ) log n + k ( n o − n ) log n ) , where k and k are number of sequences in both proﬁles, k o = k + k , and n o is the resulting proﬁle length. The overall timeof all proﬁle constructions is thus: O ( k + n f k σ + ( n f − n ) k log n ) = O ( k + n f k ( σ + log n )) , where n f is the ﬁnal proﬁle length.Adding the time of DP matrix calculation gives the total time of this stage: O ( kn σ + k + n f k ( σ + log n )) = O ( k + n k σ ) . As proﬁle alignments in the bottom part of the guide tree are independent, they can be performed in parallel. Therefore,to improve the computation time, FAMSA distributes proﬁle alignments over multiple threads. It would also be possibleto parallelise the dynamic programming computation and construction of a single proﬁle. We expect it to be particularlybeneﬁcial for families of million and more proteins. Nevertheless, we refrained from this in the current FAMSA version due toimplementation complications and lack of that large sets in existing databases.

Gap types and costs determination

Among numerous amino acid substitution matrices for dynamic programming calculation, we selected MIQS due to superiorresults reported in the recent study. The gap costs are determined according to the classic afﬁne penalty function, witha distinction between terminal and non-terminal gap open and gap extension costs, similarly to Kalign or MUSCLE. Particularly, four types of gaps are used: • gap terminal open ( T o )—opens a sequence at the left end or opens a contiguous series of gaps at the right end of asequence, • gap terminal extension ( T e )—extends a series of gaps inserted at the beginning or end of a sequence, • gap open ( G o )—opens a contiguous series of gaps inserted within a sequence, • gap extension ( G e )—extends a contiguous series of gaps inserted within a sequence.While determination of the number of gaps and their types is straightforward in pairwise alignment, it becomes problematicin MSA. Due to the fact that before aligning two proﬁles their sequences may have already contained gaps, the insertion of acolumn of gaps (either a single one or as the ﬁrst one in a contiguous series of columns with gaps) is not always equivalent tothe insertion of gap opens exclusively. Inserting only gap opens would result in an overestimation of their number. That is whytypes of gaps within a column should be corrected.In Figure 2 an exemplary alignment of two proﬁles X and Y is shown. A column of gaps is to be inserted into proﬁle X (leftpart of the ﬁgure). The proper types of gaps together with corrected gaps at the neighboring column are shown in the right partof the ﬁgure. While correcting gaps we consider the following situations: : C D E – T o T e T e S : T o T e T e – T e H I S : C D G o – G e H I S : C D E – G o G e I S : C D E – F H I S : C D T o – T e T e T e S : C G o G e Q F H I

XY S : C D E T o T e T e T e S : T o T e T e T e T e H I S : C D G o G e G e H I S : C D E G o G e G e I S : C D E G o F H I S : C D T o T e T e T e T e S : C G o G e Q F H I XY Figure 2.

Example of how gap columns are inserted during proﬁle alignment • S : there is a gap terminal open at the right side of the inserted one; hence, the inserted gap should be gap terminal open,and the following gap should be transformed into gap terminal extension, • S : there is a gap terminal extension at the left side of the inserted one; hence, the inserted gap should also begap terminal extension, • S : the inserted gap is to be placed into the gap series, so it should be gap extension, • S : there is a gap open at the right side of the inserted one, hence, the inserted gap should be gap open, and to preventthe occurrence of two gap opens one after the other, the second gap should turn into a gap extension, • S : the inserted gap is to be placed within the series of residues as the only gap, so it should be gap open, • S : there is a gap terminal open at the left side of the inserted one, hence, the inserted gap should be gap terminal exten-sion.Optimising gap parameters and recognising its inﬂuence on alignment accuracy is still the subject of intensive studies. Various techniques have been proposed, e.g., adding a bonus score to a gap cost to force to align distantly related sequences. Inour research all gap costs (i.e., gap opens and gap extensions, both terminal and non-terminal) are multiplied by a factor relatedto the number of sequences in the input collection. This prevents unnecessary widening of alignments of large collections. Thescaling factor is calculated as: g scale = + log ( k / g l ) g d , where g l and g d are two constants set by default to 45 and 7 (values chosen experimentally).The application of gap corrections and scaling leads to another modiﬁcation of the traditional approach. It is usuallyassumed that the insertion of a gap column to the ﬁrst proﬁle cannot be immediately followed by the insertion of a gap columnto the second proﬁle. Under some assumptions about the gap costs and substitution matrix values, it can be proved to bereasonable, i.e., such situation never leads to the optimal alignment. Nevertheless, this is not true if the gap correction is applied.Therefore, it is checked whether consecutive insertions of gap columns to both proﬁles render a higher-scored alignment . Iterative reﬁnement

The idea of an iterative reﬁnement is to correct misalignments made in the early phase of the proﬁle alignment. Severalalgorithms were proposed for this task, like REFINER or the methods implemented in MMSA, MSAProbs. In our recentpaper we investigated this problem showing that for sufﬁciently large collections of sequences the classical methods didnot work. We also proposed a column-oriented reﬁnement to improve the quality of alignments for collections up to 1000sequences. In this approach, the algorithm scans the proﬁle to localise columns that contain at least one gap. Then, it randomlyselects one of such columns and splits the proﬁle into two subproﬁles, depending on the gap presence in the selected column.Empty columns are removed afterwards and subproﬁles are realigned. Finally, if new alignment is scored higher than theoriginal one, it is accepted as the current solution.To simplify the time complexity analysis of reﬁnement, we assume the input and the output proﬁles to be of comparablelengths (which is usually the case). A single reﬁnement iterations requires then O ( kn f + n σ + k ( n f − n ∗ ) log n ) The proﬁle alignment score is the summed alignment score of each sequence pair. ime, with n ∗ being the length of the shorter of the two proﬁles obtained after splitting the original proﬁle.Preliminary analyses showed reﬁnement to be particularly beneﬁcial for smaller sets of sequences. Due to this reason andto improve the processing time of large protein families, the reﬁnement is applied only for k ≤ Results

Benchmark selection

An assessment of MSA algorithms was performed using benchmark datasets. The presence of high-quality, manually curatedreference alignments allowed supervised accuracy measures to be calculated. Those were sum-of-pairs (SP) and total-column(TC) scores deﬁned as fractions of correctly aligned symbol pairs and columns, respectively.Our aim was to propose an efﬁcient and robust algorithm for the alignment of thousands of proteins. The largest availablebenchmarks contain sets of at most hundreds of sequences, with an exception of HomFam introduced by Sievers et al. HomFam consists of 92 families constructed by extending Homstrad reference alignments (only those having 5 or moresequences were taken into account) with corresponding families from PFam database. This protocol results in large benchmarksets: 18 of them consists of more than 10 000 members (the number of reference sequences ranges from a few to a few tens,though).Since 2011, when HomFam was introduced, PFam and Homstrad databases have grown signiﬁcantly. Therefore, to carryout more extensive experiments, we present a new benchmark named extHomFam. It was constructed according to the HomFamgeneration protocol with several modiﬁcations. 399 high-quality alignments containing at least 3 proteins were selected fromHomstrad (ver. 1 Apr 2015). By decreasing the threshold from 5, we aimed at obtaining a larger benchmark than originalHomFam. Taking into account also two-protein families would increase extHomFam to 1 013 sets, at the cost of positivelybiasing TC score (with two reference sequences it becomes equal to SP), therefore pairwise-only alignments were excludedfrom the consideration. After that, selected Homstrad sets were enriched with corresponding PFam (ver. 28) families. Afterremoving duplicated sequences, sets of less than 200 proteins were ﬁltered out giving ﬁnal benchmark of 380 families. Forconvenience extHomFam was divided at thresholds k = small , medium , large ,and extra-large . Note, that sets of ∼ ABC tran , the most numerous set in extHomfam contains 415 519 sequences, which is the largestbenchmark protein family available.The scalability of algorithms was evaluated on 53 largest extHomFam families containing at least 30 000 sequences each.These sets were recursively downsampled to desired sizes with a guarantee of preserving sequences from reference alignments.This scheme has a valuable property of smaller sets being contained in larger ones which reduces results variability.To evaluate the performance of the presented algorithm on smaller alignment problems, classic benchmarks, i.e., BAl-iBASE, PREFAB, OXBench-X, and SABmark, were also considered in the experiments. Competitive algorithms and system setup

From among numerous sequence alignment algorithms, only those able to handle families of thousands of sequences wereinvestigated on HomFam and extHomFam. Those were MUSCLE, Kalign 2, Kalign-LCS, Clustal Omega, and MAFFT. The latter was analysed in default conﬁguration in which it calculates O ( k ) pairwise similarities, as well as -parttree and-dpparttree modes especially suited for large sets of sequences due to lower computational requirements. Clustal Omega wasexecuted with default parameters and with two combined iterations (-iter2) which was shown to give superior results in theprevious studies. MUSCLE in default mode was unfeasible for immense protein families, hence -maxiters2 variant wasalso considered. Details on execution parameters and program versions are given in the Supplementary material.The experiments on smaller benchmarks (BAliBASE, PREFAB, OXBench-X, SABmark) concerned also top consistency-based algorithms: MSAProbs, QuickProbs, and GLProbs. For the experiments we used the workstation equipped with two 12-core Intel Xeon E5-2670v3 processors (clocked at2.3 GHz), Nvidia Quadro M6000 graphic card (3072 cores clocked at 1.0 GHz), and 128 GB RAM. To investigate the behaviorof the algorithms on modern workstations and servers containing from a few to several tens of cores, all methods were run with8 computing threads, unless stated otherwise. FAMSA was run in the CPU mode, except for the experiment on the algorithmscalability w.r.t. the number of CPU cores, where GPU variant was additionally investigated.

HomFam and extHomFam benchmark evaluation

Following Sievers et al. , HomFam was divided into three parts depending on the family size. As Table 1 shows, for k ≤ < k ≤

10 000,FAMSA became the ﬁrst and the second in terms of SP and TC score, respectively. When the last subgroup was investigated,presented algorithm took the lead on both measures revealing its potential for large protein families. Importantly, FAMSA was able 1.

Comparison of algorithms on HomFam dataset.

Algorithm 93 ≤ k ≤ < k ≤ < k ≤

30 86.9

Clustal-iter2 from several to hundreds times faster than competitors. E.g., it processed entire HomFam in 12 minutes while Clustal-defaultand MAFFT-default required, respectively, 8h40m and 2h30m. Even larger difference was observed for Clustal-iter2 whichcompleted analyses in 51 hours. MAFFT-parttree and -dpparttree were also inferior to FAMSA, which is especially noteworthyas they calculate only selected pairwise similarities (usually O ( k log k ) ) instead of full matrix ( O ( k ) ).The experiments on extHomFam conﬁrmed superior accuracy and execution time of FAMSA to scale well with the numberof sequences (Figure 3; more detailed results are given in Supplementary material). FAMSA was inferior to Clustal-iter2 by asmall margin only on small subset. For 4 000 < k ≤

25 000 it became the best aligner and, depending on the subset and qualitymeasure, was followed by Clustal, MAFFT, Kalign2, or Kalign-LCS. MUSCLE, as well as fast MAFFT variants renderedinferior results (MAFFT-parttree was excluded from Figure 3 due to signiﬁcantly worse accuracy than that of -dpparttree). On extra-large

FAMSA held the lead while MAFFT-dpparttree became the second best algorithm. Kalign2, Kalign-LCS, andMUSCLE did not complete the analyses due to excessive memory or time requirements. Clustal Omega and MAFFT-defaultfailed to process, respectively, one and four largest extHomFam families (missing MAFFT results were taken from -dpparttreevariant, though). Advances in SP and TC measures of FAMSA over competing software on medium , large , and extra-large subsets were assessed statistically with a use of Wilcoxon signed-rank test with Bonferroni-Holm correction for multiple testing.The differences are signiﬁcant at α = . p -values for all pairwise comparisons can be found in Table 2. Figure 3.

Comparison of alignment software on extHomFam. The solid bars (lower) represent TC scores, while thetransparent ones (higher)—SP scores. For each subset algorithms were sorted increasingly according to TC measure. Executiontimes are given above bars in hours:minutes format.As Figure 3 shows, the quality advance of presented software over other algorithms increases for consecutive subsets. Forinstance, on extra-large , FAMSA properly aligned 35% and 25% more columns than the most accurate variants of MAFFT andClustal Omega. More detailed analysis of FAMSA accuracy compared to competitors is given in Figure 4. Four extHomFamcategories were further divided into 11 subsets having approximately 35 families. For each interval at k axis, we plotted selected able 2. Statistical signiﬁcance of FAMSA advances over selected competitors measured using Wilcoxon signed-rank testwith Bonferroni-Holm correction for multiple testing on extHomFam subsets.

Algorithm medium large extra-largeSP TC SP TC SP TCClustal-default 0.00003 0.00012 0.00011 0.00081 < − < − Clustal-iter2 0.00971 0.00940 0.00502 0.01383 0.00065 0.00137MAFFT-default < − < − < − < − < − MAFFT-dpparttree < − < − < − < − < − < − Kalign-LCS < − < − < − < − — — statistical indicators (median, mean, 15th and 85th percentile) of absolute differences in SP and TC measures between FAMSAand other algorithms. Clearly, the number of test cases for which presented software is superior to the competitors, as well asthe absolute dominance in quality, increases with growing set size. This observation is supported by the scalability analysisperformed on 53 largest families ( k ≥

30 000) randomly resampled to obtain less numerous sets. Figure 5 shows FAMSA tooutrun Clustal-default in SP and TC scores when number of sequences exceeds 7500. Clustal-iter2 graphs are crossed for largersets of sequences, i.e., k ≥

12 500.

Figure 4.

Absolute differences in SP (red) and TC (blue) scores between FAMSA and competing software for extHomFamsubsets. Each interval at the horizontal axis contains approximately 35 families. Solid lines represent medians, dashed linesindicate 15th and 85th percentiles (thus, ﬁlled areas contain 70% of observations). Means are additionally given by circular(SP) and cross (TC) markers.The abundance of extremely large protein families makes extHomFam the most demanding benchmark in terms ofcomputational resources. Beside FAMSA, only MAFFT-dpparttree and -parttree were able to process all its sequences sets.Other algorithms either crashed due to memory requirements or were terminated by us when processing time of a familyexceeded 24 hours . While Clustal-default, MAFFT-default, and MAFFT-dpparttree required, respectively, 188, 78, and 50hours, FAMSA ﬁnished computations in approximately 7 hours and 40 minutes, which corresponds to 25-fold, 10-fold, and6-fold advance. The extreme case was Clustal-iter2 which needed almost 1000 hours showing its combined iterations to beinapplicable for very large protein families. More detailed analysis of computational scalability of presented algorithm is givenin Figure 5. It conﬁrms FAMSA to be faster than MAFFT-default and Clustal Omega by approximately order and two orders ofmagnitude (depending on the parameters of the latter). The efﬁciency of presented algorithm is thanks to the fast bit-parallelsimilarity computation and the in-place proﬁle joining. Yet, as FAMSA calculates more distances than Clustal Omega andMAFFT-dpparttree ( O ( k ) instead of O ( k log k ) ), one can expect it to exceed competitor execution times for sufﬁciently large k .To verify this we compared algorithms on ABC tran , the largest family in extHomFam with 415 519 proteins. FAMSA processedthis set in less then 2 hours. Clustal Omega crashed due to excessive memory requirements after 55 hours of calculationsstrongly suggesting that the algorithm is dominated by stages other than similarity computation. The different situation was inthe case of MAFFT-dpparttree which execution time scaled better with the number of sequences, though was still inferior to An exception was made only for Clustal-iter2 due to its superior quality results. igure 5.

Scalability of SP (dashed lines) and TC (solid lines) scores with respect to the number of sequences. Experimentswere performed on 53 largest extHomFam families randomly resampled to obtain desired set size. Processing times for selectedvalues of k are given as bar plots.FAMSA by a factor of 2.5. Importantly, FAMSA required below 8 GB of RAM, while MAFFT-dpparttree allocated 47 GB. Tocompare, MAFFT-default and Clustal Omega failed to run on 128 GB machine (the former demanded 318 GB just for storingthe similarity matrix). Concluding, the calculation of all pairwise similarities performed by FAMSA did not prevent it frombeing the fastest and most memory efﬁcient aligner in the comparison even for immense protein families.As FAMSA was designed to fully utilise available computational power, it takes advantage of multi-core architectures ofnowadays computers. Ten largest protein families from extHomFam (all that contain at least 100 000 sequences: ABC tran , gtp , HATPase c , helicase NC , kinase , mdd , response reg , rvp , sdr , TyrKc ) were selected to investigate scalability of the algorithmstages with respect to the number of computing threads. In the experiments we also considered the variant of FAMSA in whichsimilarity calculation was suited for massively parallel architectures with a use of OpenCL. For convenience, processing timesof

ABC tran were marked separately. As Figure 6 shows, when FAMSA was run serially, more than 90% of the executiontime was related to stages I and II (the algorithm performs them simultaneously). Nevertheless, as pairwise similarities can becalculated independently, these stages scale with the number of threads noticeably better than the progressive construction.Particularly, when more than 12 cores were involved, stage III of the algorithm started to be the bottleneck. This was also thecase for the GPU FAMSA variant.

Classic benchmark evaluation

For completeness, the accuracy of algorithms was investigated on classic benchmarks with families ranging from a few toapproximately a hundred of sequences (Table 3). According to the expectations, consistency-based methods (QuickProbs 2,MSAProbs, and GLProbs) are superior to the competitors. When non-consistency approaches are of interest, FAMSA is thesecond best algorithm on BAliBASE and PREFAB, and the third on OXBench-X. Interestingly, it takes the lead on SABmark.The analysis of execution times conﬁrms FAMSA to be one of the fastest algorithms for low and moderately-sized sets as thosecontained in investigated benchmarks.

Discussion

The abundance of protein families containing hundreds of thousands members imposes development of algorithms computa-tionally feasible to align immense sets of sequences. Traditional progressive scheme was successfully modiﬁed by ClustalOmega and MAFFT aligners to eliminate its greatest bottleneck in large-scale analyses—calculation of all pairwise similarities.Nevertheless, experiments with FAMSA show, that computation of entire similarity matrix with the use of LCS measurecombined with memory-efﬁcient single-linkage tree construction and in-place proﬁle alignment, is orders of magnitude fasterthan competing solutions. Importantly, this comes with superior alignment quality—FAMSA was signiﬁcantly more accuratethan Clustal Omega and MAFFT on sets of a few thousands and more sequences.

ABC tran , the largest from investigatedfamilies with 415 519 sequences reveals the potential of presented software. The set was processed by FAMSA within less igure 6.

Computational scalability of FAMSA with respect to the number of cores evaluated on ten largest extHomFamfamilies ( k ≥

100 000). Algorithm stages are represented by different colours. Execution times of the largest set (

ABC tran ) aremarked with solid ﬁll, the other families are printed in with transparency.than 2 hours in less than 8 GB of RAM, which is suitable for a typical laptop computer. In contrast, Clustal Omega crashedafter 2 days of computations on 128 GB machine due to excessive memory requirements. MAFFT in memory-efﬁcient modecompleted the analysis in 5 hours allocating 47 GB of RAM, yet it successfully aligned only 5.7% of columns, while FAMSArestored 21.3%.The scalability of presented algorithm in terms of alignment quality as well as time and memory requirements, makesit applicable for protein families even of a million sequences—the no-go area for competing software. Such families willlikely be present in PFam database in the near future, as a consequence of advances in sequencing technologies. Importantly,the efﬁciency of FAMSA has the potential to be further increased. The natural option is the parallelisation of the dynamicprogramming procedure at the proﬁle construction stage, as it appeared to be a bottleneck in the scalability tests. Anotherpossibility could be better utilisation of massively parallel architectures by optimising OpenCL code for GPUs or adapting it forIntel Xeon Phi co-processors.An alternative development direction concerns alignment quality. Iterative reﬁnement is one of numerous techniques

Table 3.

Comparison of algorithms for small datasets.

Algorithm BAliBASE PREFAB OXBench-X SABmarkSP TC time SP/TC time SP TC time SP TC timeQuickProbs 2 88.0 61.7 23:41 74.2 1:41:26 89.5 80.3 1:35:35 61.1 40.8 24MSAProbs 87.8 60.8 35:29 73.7 2:26:39 89.1 80.0 2:42:09 60.2 40.0 29GLProbs 87.9 59.3 23:21 72.4 1:25:40 89.1 80.0 1:10:08 61.4 41.4 3:55MAFFT auto 86.5 58.7 10:20 72.6 17:28 88.7 79.4 7:13 57.3 36.8 1:01Clustal-iter2 84.8 56.7 67:32 71.0 2:35:46 88.5 79.5 45:30 55.2 35.7 2:52Clustal-default 84.2 55.9 7:41 70.0 21:56 87.8 78.1 7:34 55.0 35.5 3:52FAMSA 83.6 53.5 2:01 68.1 5:23 87.3 77.2 1:09 56.9 37.6 19Kalign-LCS 83.0 50.4 29 65.9 1:51 86.8 76.4 36 55.6 35.6 2Muscle 81.9 47.8 14:10 67.7 35:04 87.5 77.6 26:44 54.5 33.5 45MAFFT default 81.7 47.5 1:48 68.0 5:58 86.6 76.2 1:39 53.2 33.0 39Kalign2 81.1 47.1 37 65.5 2:03 86.3 75.9 48 52.4 32.6 2 esigned for accuracy improvement. Due to computational reasons, it is performed by FAMSA on families of less than 1000sequences, though. One can consider applying some limited, less time-consuming reﬁnement scheme also for larger setsof sequences. The different ideas include introducing proﬁle Markov models or consistency. Until recently the latter wasfound infeasible for large families because of excessive computational requirements. However, our latest research showedthat applying consistency only on a small, carefully selected fraction of sequences, may elevate alignment quality withoutcompromising execution time. The experiments concerned sets up to thousand of sequences, accordingly the scalability ofpresented ideas to families two orders of magnitude larger is an open question. Moreover, designing consistency schemesuitable for FAMSA is a non-trivial task.The separate issue related to large-scale analyses is an accuracy assessment, particularly the lack of reference sequences.Evaluating quality of alignment of 10 000 or more proteins on the basis of a reference containing only small fraction of membersis the largest ﬂaw of the experimental pipeline used in the current research. We believe that advances in multiple alignmentdomain should be facilitated with the development of new benchmark datasets containing more reference sequences.FAMSA executables together with source code are available at https://github.com/refresh-bio/FAMSA , extHomFam canbe downloaded from http://dx.doi.org/10.7910/DVN/BO2SVW . Web service for remote analyses is under development. References G Blackshields, F Sievers, W Shi, A Wilm, and D Higgins (2010) Sequence embedding for fast construction of guide treesfor multiple sequence alignment.

Algorithm Mol Biol K Boyce, F Sievers and DG Higgins (2014) Simple chained guide trees give high-quality protein multiple sequencealignments.

Proc Nat Acad Sci USA K Boyce, F Sievers and DG Higgins (2015) Reply to Tan et al.: Differences between real and simulated proteins in multiplesequence alignments.

Proc Nat Acad Sci USA K Boyce, F Sievers and DG Higgins (2015) Instability in progressive multiple sequence alignment algorithms.

AlgorithmMol Biol 10 S Chakrabarti, CJ Lanczycki, AR Panchenko, TM Przytycka, PA Thiessen and SH Bryant (2006) Reﬁning multiplesequence alignments with conserved core regions.

Nucleic Acids Res M Chatzou et al. (2015) Multiple sequence alignment modeling: methods and applications.

Brief Bioinform doi:10.1093/bib/bbv099. S Deorowicz, A Debudaj-Grabysz and A Gudy´s (2014)

Kalign-LCS—A More Accurate and Faster Variant of Kalign2Algorithm for the Multiple Sequence Alignment Problem , Chapter in book Man-Machine Interactions 3, Series Advances inIntelligent Systems and Computing, Springer-Verlag Berlin Heidelberg, (A Gruca, T Czach´orski, S Kozielski, Editors) 242,pp 495–502. ChB Do, MSP Mahabhashyam, M Brudno, and S Batzoglou (2005) ProbCons: Probabilistic consistency-based multiplesequence alignment.

Genome Res , 15(2):330–340. RC Edgar (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

BMCBioinformatics

RC Edgar (2009) Optimizing substitution matrix choice and gap parameters for sequence alignment

BMC Bioinformatics

K Florek, J Łukaszewicz, J Perkal, H Steinhaus and S Zubrzycki (1951) Sur la liaison et la division des points d’unensemble ﬁni.

Colloquium Mathematicae

D Gusﬁeld (1997) Algorithms on Strings, Trees and Sequences. Cambridge University Press.

A Gudy´s and S Deorowicz (2014) QuickProbs—A Fast Multiple Sequence Alignment Algorithm Designed for GraphicsProcessors.

PLoS One

A Gudy´s and S Deorowicz (2015)

QuickProbs 2: towards rapid construction of high-quality alignments of large proteinfamilies . Preprint available at: http://arxiv.org/abs/1512.07437.

H Hyyr¨o (2004)

Bit-parallel LCS-length computation revisited

In Proceedings of the 15th Australian Workshop onCombinatorial Algorithms (AWOCA 2004), pp 16–27.

Intel Corporation (2015) Intel 64 and IA-32 Architectures Software Developer’s Manual. Combined Volumes: 1, 2A, 2B,2C, 3A, 3B, 3C and 3D. K Katoh and H Toh (2007) PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences.

Bioinformatics

K Katoh and H Toh (2008) Recent developments in the MAFFT multiple sequence alignment program.

Brief Bioinform

K Katoh and DM Standley (2013) MAFFT multiple sequence alignment software version 7: improvements in performanceand usability.

Mol Biol Evol , 30:772–780.

T Lassmann and ELL Sonnhammer (2005) Kalign—an accurate and fast multiple sequence alignment algorithm.

BMCBioinformatics

T Lassmann, O Frings and ELL Sonnhammer (2009) Kalign2: high-performance multiple alignment of protein andnucleotide sequences allowing external features.

Nucleic Acids Res

Y Liu, B Schmidt and D.L Maskell (2010) MSAProbs: multiple sequence alignment based on pair hidden Markov modelsand partition function posterior probabilities.

Bioinformatics

K Mizuguchi, CM Deane, TL Blundell, and JP Overington (1998) HOMSTRAD: a database of protein structure alignmentsfor homologous families.

Protein Sci

R Muth and U Manber (1996) Approximate multiple string search. In

Proceedings of the 7th Annual Symposium onCombinatorial Pattern Matching , pp 75–86.

C Notredame, DG Higgins, and J Heringa (2000) T-Coffee: A novel method for fast and accurate multiple sequencealignment.

J Mol Biol

I Plyusnin and L Holm (2012) Comprehensive comparison of graph based multiple protein sequence alignment strategies.

BMC Bioinformatics

G Raghava, G Searle, P Audley, J Barber, and G Barton (2003) OXBench: A benchmark for evaluation of protein multiplesequence alignment accuracy.

BMC Bioinformatics , 4(1): 47.

M Punta et al. (2012) The Pfam protein families database.

Nucleic Acids Res

N Saitou and M Nei (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Mol BiolEvol

R Sibson (1973) SLINK: An optimally efﬁcient algorithm for the single-link cluster method.

The Computer Journal

F Sievers, D Dinnen, A Wilm and DG Higgins (2013) Making automated multiple alignments of very large numbers ofprotein sequences.

Bioinformatics

F Sievers, et al. (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Mol Syst Biol

R.R Sokal and C.D Michener (1958) A statistical method for evaluating systematic relationships.

Univ Kans Sci Bull

G Tan, M Gil, AP L oytynoja, N Goldman and C Dessimoz (2015) Simple chained guide trees give poorer multiplesequence alignments than inferred trees in simulation and phylogenetic benchmarks.

Proc Nat Acad Sci USA

JD Thompson, DG Higgins, and TJ Gibson (1994) CLUSTAL W: improving the sensitivity of progressive multiplesequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice.

Nucleic AcidsRes , 22(22):4673–4680.

JD Thompson, F Plewniak, O Poch (1999) BAliBASE: a benchmark alignment database for the evaluation of multiplealignment programs.

Bioinformatics , 15(1):87–88.

I Walle, I Lasters, L Wyns (2005) SABmark—a benchmark for sequence alignment that covers the entire known fold space.

Bioinformatics , 21(7):1267–1268.

S Wu and U Manber (1992) Fast text searching: allowing errors.

Communications of the ACM

K Yamada and K Tomii (2014) Revisiting amino acid substitution matrices for identifying distantly related proteins.

Bioinformatics

Y Ye, et al. (2015) GLProbs: Aligning Multiple Sequences Adaptively.

IEEE/ACM Trans Comput Biol Bioinf cknowledgements

The work was supported by Polish National Science Centre under the projects DEC-2011/03/B/ST6/01588 and DEC-2015/17-/B/ST6/01890 and by Silesian University of Technology under the project BK-263/RAu2/2015, performed using the infras-tructure supported by POIG.02.03.01-24-099/13 grant: ‘GeCONiI—Upper Silesian Center for Computational Science andEngineering’.

Author contributions statement

SD and ADG designed the main part of the algorithm. AG and SD designed the GPU part of the algorithm. SD, ADG, and AGprepared the implementation. AG and SD designed and prepared the new benchmark. SD performed the experiments. SD, AG,and ADG drafted the manuscript and supplementary material. All authors read and approved the ﬁnal manuscript.

Additional information

The supplementary material contains details on how the data were prepared and how the experiments were performed.

Competing ﬁnancial interests

The authors declare no competing ﬁnancial interests. upplementary material xamined programs

The following programs were used in the experimental part. The running parameters are also given. • Clustal Omega v. 1.2.0 – -i < input > -o < output > --threads=8 – -i < input > -o < output > --threads=8 --iter=2 • FAMSA v. 1.0 – -t 8 < input > < output > • GLProbs v. 1.0 – -num threads < input > -o < output > • Kalign v. 2.04 – -quiet -i < input > -o < output > • Kalign-LCS v. 2.04 – -quiet -b upgma -d lcs indel -i < input > -o < output > • MAFFT v. 7.221 – auto: --auto --quiet --thread 8 --anysymbol < input > – default: --quiet --thread 8 --anysymbol < input > – parttree: --quiet --thread 8 --anysymbol --parttree < input > – dpparttree: --quiet --thread 8 --anysymbol --dpparttree < input > • MSAProbs v. 0.9.7 – -num threads < input > -o < output > • MUSCLE v. 3.8.31 – default: -quiet -in < input > -out < output > – maxiters2: -quiet -in < input > -out < output > , -maxiters 2 • QuickProbs v. 2 – -t 8 < input > -o < output > Additional results a b l e4 . C o m p a r i s ono f a l go r it h m s f o r E x t H o m F a m d a t a s e t s . T i m e s a r e g i v e n i nhou r s : m i nu t e s : s ec ond s f o r m a t . A l go r it h m S m a ll M e d i u m L a r g e E x t r a l a r g e A ll < k ≤ < k ≤ < k < ≤ < k ≤ f a m ili e s f a m ili e s f a m ili e s f a m ili e s f a m ili e s SP T C ti m e SP T C ti m e SP T C ti m e SP T C ti m e SP T C ti m e F A M S A - op t . . : . . : . . : . . : : . . : : C l u s t a l O m e g a – it e r . . : : . . : : . . : : . . : : . . : : C l u s t a l O m e g a . . : . . : : . . : : . . : : . . : : K a li gn - L C S . . : . . : : . . : : —————— K a li gn276 . . : . . : : . . : : —————— M U S C LE m a x it e r s . . : . . : : . . : : —————— M A FF T d e f a u lt . . : . . : : . . : : . . : : . . : : M A FF T p a r tt r ee . . : . . : . . : : . . : : . . : : M A FF T dpp a r tt r ee . . : . . : : . . : : . . : : . . : :26