Veli Mäkinen
University of Helsinki
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Veli Mäkinen.
ACM Computing Surveys | 2007
Gonzalo Navarro; Veli Mäkinen
Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into self-indexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, which radically changed the status of this area in less than 5 years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this article we present the main concepts underlying (compressed) self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems. Our aim is to give the background to understand and follow the developments in this area.
ACM Transactions on Algorithms | 2007
Paolo Ferragina; Giovanni Manzini; Veli Mäkinen; Gonzalo Navarro
Given a sequence <i>S</i> = <i>s</i><sub>1</sub><i>s</i><sub>2</sub>…<i>s</i><sub><i>n</i></sub> of integers smaller than <i>r</i> = <i>O</i>(polylog(<i>n</i>)), we show how <i>S</i> can be represented using <i>nH</i><sub>0</sub>(<i>S</i>) + <i>o</i>(<i>n</i>) bits, so that we can know any <i>s</i><sub><i>q</i></sub>, as well as answer <i>rank</i> and <i>select</i> queries on <i>S</i>, in constant time. <i>H</i><sub>0</sub>(<i>S</i>) is the zero-order empirical entropy of <i>S</i> and <i>nH</i><sub>0</sub>(<i>S</i>) provides an information-theoretic lower bound to the bit storage of any sequence <i>S</i> via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in <i>O</i>(log <i>r</i>) time. For larger <i>r</i>, we can still represent <i>S</i> in <i>nH</i><sub>0</sub>(<i>S</i>) + <i>o</i>(<i>n</i> log <i>r</i>) bits and answer queries in <i>O</i>(log <i>r</i>/log log <i>n</i>) time. Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design <i>compressed full-text indexes</i> that scale well with the size of the input alphabet Σ. Specifically, we design a variant of the FM-index that indexes a string <i>T</i>[1, <i>n</i>] within <i>nH</i><sub><i>k</i></sub>(<i>T</i>) + <i>o</i>(<i>n</i>) bits of storage, where <i>H</i><sub><i>k</i></sub>(<i>T</i>) is the <i>k</i>th-order empirical entropy of <i>T</i>. This space bound holds simultaneously for all <i>k</i> ≤ α log<sub>|Σ|</sub> <i>n</i>, constant 0 < α < 1, and |Σ| = <i>O</i>(polylog(<i>n</i>)). This index counts the occurrences of an arbitrary pattern <i>P</i>[1, <i>p</i>] as a substring of <i>T</i> in <i>O</i>(<i>p</i>) time; it locates each pattern occurrence in <i>O</i>(log<sup>1+ϵ</sup> <i>n</i>) time for any constant 0 < ϵ < 1; and reports a text substring of length ℓ in <i>O</i>(ℓ + log<sup>1+ϵ</sup> <i>n</i>) time. Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the <i>k</i>th-order entropy of the text <i>T</i>, which is the best space obtained in previous work. We can also handle larger alphabets of size |Σ| = <i>O</i>(<i>n</i><sup>β</sup>), for any 0 < β < 1, by paying <i>o</i>(<i>n</i> log|Σ|) extra space and multiplying all query times by <i>O</i>(log |Σ|/log log <i>n</i>).
Journal of Computational Biology | 2010
Veli Mäkinen; Gonzalo Navarro; Jouni Sirén; Niko Välimäki
A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occupies much space, which very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce the space of the suffix tree to, essentially, that of the compressed sequences, while retaining its functionality with only a polylogarithmic slowdown. However, the underlying compression model considers only the predictability of the next sequence symbol given the k previous ones, where k is a small integer. This is unable to capture longer-term repetitiveness. For example, r identical copies of an incompressible sequence will be incompressible under this model. We develop new static and dynamic full-text indexes that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations. The new indexes can be plugged into a recent dynamic fully-compressed suffix tree, achieving full functionality for sequence analysis, while retaining the reduced space and the polylogarithmic slowdown. Our experimental results confirm the practicality of our proposal.
Theoretical Computer Science | 2007
Veli Mäkinen; Gonzalo Navarro
The deep connection between the Burrows-Wheeler transform (BWT) and the so-called rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. It has been shown that improvements to rank/select algorithms, in combination with the BWT, turn into improved compressed text indexes. This paper is devoted to alternative implementations and extensions of rank and select data structures. First, we show that one can use gap encoding techniques to obtain constant time rank and select queries in essentially the same space as what is achieved by the best current direct solution (and sometimes less). Second, we extend symbol rank and select to substring rank and select, giving several space/time trade-offs for the problem. An application of these queries is in position-restricted substring searching, where one can specify the range in the text where the search is restricted to, and only occurrences residing in that range are to be reported. In addition, arbitrary occurrences are reported in text position order. Several byproducts of our results display connections with searchable partial sums, Chazelles two-dimensional data structures, and Grossi et al.s wavelet trees.
Nature Communications | 2014
Virpi Ahola; Rainer Lehtonen; Panu Somervuo; Leena Salmela; Patrik Koskinen; Pasi Rastas; Niko Välimäki; Lars Paulin; Jouni Kvist; Niklas Wahlberg; Jaakko Tanskanen; Emily A. Hornett; Laura Ferguson; Shiqi Luo; Zijuan Cao; Maaike de Jong; Anne Duplouy; Olli-Pekka Smolander; Heiko Vogel; Rajiv C. McCoy; Kui Qian; Wong Swee Chong; Qin Zhang; Freed Ahmad; Jani K. Haukka; Aruj Joshi; Jarkko Salojärvi; Christopher W. Wheat; Ewald Grosse-Wilde; Daniel C. Hughes
Previous studies have reported that chromosome synteny in Lepidoptera has been well conserved, yet the number of haploid chromosomes varies widely from 5 to 223. Here we report the genome (393 Mb) of the Glanville fritillary butterfly (Melitaea cinxia; Nymphalidae), a widely recognized model species in metapopulation biology and eco-evolutionary research, which has the putative ancestral karyotype of n=31. Using a phylogenetic analyses of Nymphalidae and of other Lepidoptera, combined with orthologue-level comparisons of chromosomes, we conclude that the ancestral lepidopteran karyotype has been n=31 for at least 140 My. We show that fusion chromosomes have retained the ancestral chromosome segments and very few rearrangements have occurred across the fusion sites. The same, shortest ancestral chromosomes have independently participated in fusion events in species with smaller karyotypes. The short chromosomes have higher rearrangement rate than long ones. These characteristics highlight distinctive features of the evolutionary dynamics of butterflies and moths.
string processing and information retrieval | 2004
Paolo Ferragina; Giovanni Manzini; Veli Mäkinen; Gonzalo Navarro
We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-index which scales well with the size of the input alphabet Σ. The size of the new index built on a string T[1,n] is bounded by \(n H_{k}(T) + O\bigl((n \log\log n)/\log_{\vert {\Sigma}\vert } n\bigr)\) bits, where H k (T) is the k-th order empirical entropy of T.
Theoretical Computer Science | 2009
Johannes Fischer; Veli Mäkinen; Gonzalo Navarro
Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix tree representation could fit in a faster memory, outweighing by far the theoretical slowdown brought by the space reduction. We present a novel compressed suffix tree, which is the first achieving at the same time sublogarithmic complexity for the operations, and space usage that asymptotically goes to zero as the entropy of the text does. The main ideas in our development are compressing the longest common prefix information, totally getting rid of the suffix tree topology, and expressing all the suffix tree operations using range minimum queries and a novel primitive called next/previous smaller value in a sequence. Our solutions to those operations are of independent interest.
combinatorial pattern matching | 2007
Niko Välimäki; Veli Mäkinen
We study the Document Listing problem, where a collection D of documents d1,..., dk of total length Σi di = n is to be preprocessed, so that one can later efficiently list all the ndoc documents containing a given query pattern P of length m as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with O(n) time preprocessing, one can answer the queries in O(m + ndoc) time. In this paper, we improve the space-requirement of the Muthukrishnans solution from O(n log n) bits to |CSA| + 2n + n log k(1 + o(1)) bits, where |CSA| ≤ n log |Σ|(1 + o(1)) is the size of any suitable compressed suffix array (CSA), and Σ is the underlying alphabet of documents. The time requirement depends on the CSA used, but we can obtain e.g. the optimal O(m+ndoc) time when |Σ|, k = O(polylog(n)). For general |Σ|, k the time requirement becomes O(mlog |Σ| + ndoc log k). Sadakane (ISAAC 2002) has developed a similar space-efficient variant of the Muthukrishnans solution; we obtain a better time requirement in most cases, but a slightly worse space requirement.
Bioinformatics | 2011
Leena Salmela; Veli Mäkinen; Niko Välimäki; Johannes Ylinen; Esko Ukkonen
Motivation: Assembling genomes from short read data has become increasingly popular, but the problem remains computationally challenging especially for larger genomes. We study the scaffolding phase of sequence assembly where preassembled contigs are ordered based on mate pair data. Results: We present MIP Scaffolder that divides the scaffolding problem into smaller subproblems and solves these with mixed integer programming. The scaffolding problem can be represented as a graph and the biconnected components of this graph can be solved independently. We present a technique for restricting the size of these subproblems so that they can be solved accurately with mixed integer programming. We compare MIP Scaffolder to two state of the art methods, SOPRA and SSPACE. MIP Scaffolder is fast and produces better or as good scaffolds as its competitors on large genomes. Availability: The source code of MIP Scaffolder is freely available at http://www.cs.helsinki.fi/u/lmsalmel/mip-scaffolder/. Contact: [email protected]
string processing and information retrieval | 2008
Jouni Sirén; Niko Välimäki; Veli Mäkinen; Gonzalo Navarro
A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N . Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive sets of highly repetitive sequence collections in space-efficient manner so that retrieval of the content as well as queries on the content of the sequences can be provided time-efficiently. We show that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task. We engineer some new structures that use run-length encoding and give empirical evidence that these structures are superior to the current structures.