Andreas Sand
Aarhus University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Andreas Sand.
symposium on discrete algorithms | 2013
Gerth Stølting Brodal; Rolf Fagerberg; Thomas Mailund; Christian N. S. Pedersen; Andreas Sand
The triplet and quartet distances are distance measures to compare two rooted and two unrooted trees, respectively. The leaves of the two trees should have the same set of n labels. The distances are defined by enumerating all subsets of three labels (triplets) and four labels (quartets), respectively, and counting how often the induced topologies in the two input trees are different. In this paper we present efficient algorithms for computing these distances. We show how to compute the triplet distance in time O(n log n) and the quartet distance in time O(dn log n), where d is the maximal degree of any node in the two trees. Within the same time bounds, our framework also allows us to compute the parameterized triplet and quartet distances, where a parameter is introduced to weight resolved (binary) topologies against unresolved (non-binary) topologies. The previous best algorithm for computing the triplet and parameterized triplet distances have O(n2) running time, while the previous best algorithms for computing the quartet distance include an O(d9n log n) time algorithm and an O(n2.688) time algorithm, where the latter can also compute the parameterized quartet distance. Since d ≤ n, our algorithms improve on all these algorithms.The papers in this volume were presented at the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, held January 6-8, 2013 in New Orleans, Louisiana, USA. The symposium was jointly sponsored by SIGACT (the ACM Special Interest Group on Algorithms and Computation Theory) and by the SIAM Activity Group on Discrete Mathematics.
BMC Bioinformatics | 2013
Andreas Sand; Martin Kristiansen; Christian N. S. Pedersen; Thomas Mailund
BackgroundHidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms. Calculating the likelihood of a model using the forward algorithm has worst case time complexity linear in the length of the sequence and quadratic in the number of states in the model. For genome analysis, however, the length runs to millions or billions of observations, and when maximising the likelihood hundreds of evaluations are often needed. A time efficient forward algorithm is therefore a key ingredient in an efficient hidden Markov model library.ResultsWe have built a software library for efficiently computing the likelihood of a hidden Markov model. The library exploits commonly occurring substrings in the input to reuse computations in the forward algorithm. In a pre-processing step our library identifies common substrings and builds a structure over the computations in the forward algorithm which can be reused. This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models.Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model. In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library.ConclusionsWe have implemented the preprocessing procedure and forward algorithm as a C++ library, zipHMM, with Python bindings for use in scripts. The library is available at http://birc.au.dk/software/ziphmm/.
ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011
Jesper Buus Nielsen; Andreas Sand
Two of the most important algorithms for Hidden Markov Models are the forward and the Viterbi algorithms. We show how formulating these using linear algebra naturally lends itself to parallelization. Although the obtained algorithms are slow for Hidden Markov Models with large state spaces, they require very little communication between processors, and are fast in practice on models with a small state space. We have tested our implementation against two other implementations on artificial data and observe a speed-up of roughly a factor of 5 for the forward algorithm and more than 6 for the Viterbi algorithm. We also tested our algorithm in the Coalescent Hidden Markov Model framework, where it gave a significant speed-up.
Biology | 2013
Andreas Sand; Morten Kragelund Holt; Jens Johansen; Rolf Fagerberg; Gerth Stølting Brodal; Christian N. S. Pedersen; Thomas Mailund
Distance measures between trees are useful for comparing trees in a systematic manner, and several different distance measures have been proposed. The triplet and quartet distances, for rooted and unrooted trees, respectively, are defined as the number of subsets of three or four leaves, respectively, where the topologies of the induced subtrees differ. These distances can trivially be computed by explicitly enumerating all sets of three or four leaves and testing if the topologies are different, but this leads to time complexities at least of the order n3 or n4 just for enumerating the sets. The different topologies can be counted implicitly, however, and in this paper, we review a series of algorithmic improvements that have been used during the last decade to develop more efficient algorithms by exploiting two different strategies for this; one based on dynamic programming and another based on coloring leaves in one tree and updating a hierarchical decomposition of the other.
Bioinformatics | 2014
Andreas Sand; Morten Kragelund Holt; Jens Johansen; Gerth Stølting Brodal; Thomas Mailund; Christian N. S. Pedersen
UNLABELLED tqDist is a software package for computing the triplet and quartet distances between general rooted or unrooted trees, respectively. The program is based on algorithms with running time [Formula: see text] for the triplet distance calculation and [Formula: see text] for the quartet distance calculation, where n is the number of leaves in the trees and d is the degree of the tree with minimum degree. These are currently the fastest algorithms both in theory and in practice. AVAILABILITY AND IMPLEMENTATION tqDist can be installed on Windows, Linux and Mac OS X. Doing this will install a set of command-line tools together with a Python module and an R package for scripting in Python or R. The software package is freely available under the GNU LGPL licence at http://birc.au.dk/software/tqDist.
asia-pacific bioinformatics conference | 2013
Andreas Sand; Gerth Stølting Brodal; Rolf Fagerberg; Christian N. S. Pedersen; Thomas Mailund
The triplet distance is a distance measure that compares two rooted trees on the same set of leaves by enumerating all sub-sets of three leaves and counting how often the induced topologies of the tree are equal or different. We present an algorithm that computes the triplet distance between two rooted binary trees in time O (n log2n). The algorithm is related to an algorithm for computing the quartet distance between two unrooted binary trees in time O (n log n). While the quartet distance algorithm has a very severe overhead in the asymptotic time complexity that makes it impractical compared to O (n2) time algorithms, we show through experiments that the triplet distance algorithm can be implemented to give a competitive wall-time running time.
Journal of Theoretical Biology | 2013
Andreas Sand; Mike Steel
Evolutionary events such as incomplete lineage sorting and lateral gene transfers constitute major problems for inferring species trees from gene trees, as they can sometimes lead to gene trees which conflict with the underlying species tree. One particularly simple and efficient way to infer species trees from gene trees under such conditions is to combine three-taxon analyses for several genes using a majority vote approach. For incomplete lineage sorting this method is known to be statistically consistent; however, for lateral gene transfers it was recently shown that a zone of inconsistency exists for a specific four-taxon tree topology, and it was posed as an open question whether inconsistencies could exist for other four-taxon tree topologies? In this letter we analyze all remaining four-taxon topologies and show that no other inconsistencies exist.
Biology | 2013
Paula Tataru; Andreas Sand; Asger Hobolth; Thomas Mailund; Christian N. S. Pedersen
Hidden Markov Models (HMMs) are widely used probabilistic models, particularly for annotating sequential data with an underlying hidden structure. Patterns in the annotation are often more relevant to study than the hidden structure itself. A typical HMM analysis consists of annotating the observed data using a decoding algorithm and analyzing the annotation to study patterns of interest. For example, given an HMM modeling genes in DNA sequences, the focus is on occurrences of genes in the annotation. In this paper, we define a pattern through a regular expression and present a restriction of three classical algorithms to take the number of occurrences of the pattern in the hidden sequence into account. We present a new algorithm to compute the distribution of the number of pattern occurrences, and we extend the two most widely used existing decoding algorithms to employ information from this distribution. We show experimentally that the expectation of the distribution of the number of pattern occurrences gives a highly accurate estimate, while the typical procedure can be biased in the sense that the identified number of pattern occurrences does not correspond to the true number. We furthermore show that using this distribution in the decoding algorithms improves the predictive power of the model.
2010 Ninth International Workshop on Parallel and Distributed Methods in Verification, and Second International Workshop on High Performance Computational Systems Biology | 2010
Andreas Sand; Christian N. S. Pedersen; Thomas Mailund; Asbjorn Tolbol Brask
Molecular Phylogenetics and Evolution | 2013
Dietrich Radel; Andreas Sand; Mike Steel