Juha Kärkkäinen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Juha Kärkkäinen is active.

Explore More

Publication

Featured researches published by Juha Kärkkäinen.

The Computer Journal | 1999

TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies

Yka Huhtala; Juha Kärkkäinen; Pasi Porkka; Hannu Toivonen

The discovery of functional dependencies from relations is an important database analysis technique. We present TANE, an efficient algorithm for finding functional dependencies from large databases. TANE is based on partitioning the set of rows with respect to their attribute values, which makes testing the validity of functional dependencies fast even for a large number of tuples. The use of partitions also makes the discovery of approximate functional dependencies easy and efficient and the erroneous or exceptional rows can be identified easily. Experiments show that T ANE is fast in practice. For benchmark databases the running times are improved by several orders of magnitude over previously published results. The algorithm is also applicable to much larger datasets than the previous methods.

international colloquium on automata languages and programming | 2003

Simple linear work suffix array construction

Juha Kärkkäinen; Peter Sanders

A suffix array represents the suffixes of a string in sorted order. Being a simpler and more compact alternative to suffix trees, it is an important tool for full text indexing and other string processing tasks. We introduce the skew algorithm for suffix array construction over integer alphabets that can be implemented to run in linear time using integer sorting as its only nontrivial subroutine: 1. recursively sort suffixes beginning at positions i mod 3 ≠ 0. 2. sort the remaining suffixes using the information obtained in step one. 3. merge the two sorted sequences obtained in steps one and two. The algorithm is much simpler than previous linear time algorithms that are all based on the more complicated suffix tree data structure. Since sorting is a well studied problem, we obtain optimal algorithms for several other models of computation, e.g. external memory with parallel disks, cache oblivious, and parallel. The adaptations for BSP and EREW-PRAM are asymptotically faster than the best previously known algorithms.

Journal of the ACM | 2006

Linear work suffix array construction

Juha Kärkkäinen; Peter Sanders; Stefan Burkhardt

Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff. For any v ∈ [1,&nradic;], it runs in O(vn) time using O(n/&vradic;) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.

international conference on data engineering | 1998

Efficient discovery of functional and approximate dependencies using partitions

Yka Huhtala; Juha Kärkkäinen; Pasi Porkka; Hannu Toivonen

Discovery of functional dependencies from relations has been identified as an important database analysis technique. We present a new approach for finding functional dependencies from large databases, based on partitioning the set of rows with respect to their attribute values. The use of partitions makes the discovery of approximate functional dependencies easy and efficient, and the erroneous or exceptional rows can be identified easily. Experiments show that the new algorithm is efficient in practice. For benchmark databases the running times are improved by several orders of magnitude over previously published results. The algorithm is also applicable to much larger datasets than the previous methods.

combinatorial pattern matching | 2002

Better filtering with gapped q-grams

Stefan Burkhardt; Juha Kärkkäinen

A popular and well-studied class of filters for approximate string matching compares substrings of length q, the q-grams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped q-grams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we report the first results of a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. To achieve these results the arrangement of the gaps in the q-gram and a filter parameter called threshold have to be optimized. Both of these tasks are nontrivial combinatorial optimization problems for which we present efficient solutions. We concentrate on the k mismatches problem, i.e, approximate string matching with the Hamming distance.

computing and combinatorics conference | 1996

Sparse Suffix Trees

Juha Kärkkäinen; Esko Ukkonen

A sparse suffix tree is a suffix tree that represents only a subset of the suffixes of the text. This is in contrast to the standard suffix tree that represents all suffixes. By selecting a small enough subset, a sparse suffix tree can be made to fit the available storage, unfortunately at the cost of increased search times. The idea of sparse suffix trees goes back to PATRICIA tries. Evenly spaced sparse suffix trees represent every kth suffix of the text. In the paper, we give general construction and search algorithms for evenly spaced sparse suffix trees, and present their run time analysis, both in the worst and in the average case. The algorithms are further improved by using so-called dual suffix trees.

ACM Journal of Experimental Algorithms | 2008

Better external memory suffix array construction

Roman Dementiev; Juha Kärkkäinen; Jens Mehnert; Peter Sanders

Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications, in particular, in bioinformatics. However, so far, it has appeared prohibitive to build suffix arrays for huge inputs that do not fit into main memory. This paper presents design, analysis, implementation, and experimental evaluation of several new and improved algorithms for suffix array construction. The algorithms are asymptotically optimal in the worst case or on average. Our implementation can construct suffix arrays for inputs of up to 4-GB in hours on a low-cost machine. As a tool of possible independent interest, we present a systematic way to design, analyze, and implement pipelined algorithms.

combinatorial pattern matching | 2009

Permuted Longest-Common-Prefix Array

Juha Kärkkäinen; Giovanni Manzini; Simon J. Puglisi

The longest-common-prefix (LCP) array is an adjunct to the suffix array that allows many string processing problems to be solved in optimal time and space. Its construction is a bottleneck in practice, taking almost as long as suffix array construction. In this paper, we describe algorithms for constructing the permuted LCP (PLCP) array in which the values appear in position order rather than lexicographical order. Using the PLCP array, we can either construct or simulate the LCP array. We obtain a family of algorithms including the fastest known LCP construction algorithm and some extremely space efficient algorithms. We also prove a new combinatorial property of the LCP values.

Theoretical Computer Science | 2007

Fast BWT in small space by blockwise suffix sorting

Juha Kärkkäinen

Abstract We present a new space- and time-efficient algorithm for computing the Burrow–Wheeler transform (BWT). For any choice of a parameter v ∈ [ 3 , n 2 / 3 ] , the computation of BWT for a text of length n takes O ( n log n + v n ) worst-case time and O ( n log n + v n ) average-case time using O ( n log n / v ) bits of space in addition to the text and the BWT. For example, if v = log 2 n , the time is O ( n log 2 n ) in the worst case and O ( n log n ) on an average with the additional space requirement of O ( n ) bits. The algorithm is alphabet-independent: it uses only character comparisons, and the complexities do not depend on the alphabet size unless v does. A practical implementation is 2–3 times slower than one of the fastest and most space-efficient previous algorithms while needing only one-third of the main memory. The algorithm is based on suffix arrays, but unlike any other algorithm, it can construct the suffix array a small block at a time without storing the rest of the suffix array anywhere.

combinatorial pattern matching | 1995

Suffix cactus: A cross between suffix tree and suffix array

Juha Kärkkäinen

The suffix cactus is a new alternative to the suffix tree and the suffix array as an index of large static texts. Its size and its performance in searches lies between those of the suffix tree and the suffix array. Structurally, the suffix cactus can be seen either as a compact variation of the suffix tree or as an augmented suffix array.

Explore More