Kunihiko Sadakane | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kunihiko Sadakane is active.

Explore More

Publication

Featured researches published by Kunihiko Sadakane.

Bioinformatics | 2015

MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

Dinghua Li; Chi-Man Liu; Ruibang Luo; Kunihiko Sadakane; Tak Wah Lam

MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252 Gbps in 44.1 and 99.6 h on a single computing node with and without a graphics processing unit, respectively. MEGAHIT assembles the data as a whole, i.e. no pre-processing like partitioning and normalization was needed. When compared with previous methods on assembling the soil data, MEGAHIT generated a three-time larger assembly, with longer contig N50 and average contig length; furthermore, 55.8% of the reads were aligned to the assembly, giving a fourfold improvement.

Theory of Computing Systems \/ Mathematical Systems Theory | 2007

Compressed Suffix Trees with Full Functionality

Kunihiko Sadakane

AbstractWe introduce new data structures for compressed suffix trees whose size are linear in the text size. The size is measured in bits; thus they occupy only O(n log|A|) bits for a text of length n on an alphabet A. This is a remarkable improvement on current suffix trees which require O(n log n) bits. Though some components of suffix trees have been compressed, there is no linear-size data structure for suffix trees with full functionality such as computing suffix links, string-depths and lowest common ancestors. The data structure proposed in this paper is the first one that has linear size and supports all operations efficiently. Any algorithm running on a suffix tree can also be executed on our compressed suffix trees with a slight slowdown of a factor of polylog(n).

Journal of Algorithms | 2003

New text indexing functionalities of the compressed suffix arrays

Kunihiko Sadakane

New text indexing functionalities of the compressed suffix arrays are proposed. The compressed suffix array proposed by Grossi and Vitter is a space-efficient data structure for text indexing. It occupies only O(n) bits for a text of length η; however it also uses the text itself that occupies n log2 |A| bits for the alphabet A. In this paper we modify the data structure so that pattern matching can be done without any access to the text. In addition to the original functions of the compressed suffix array, we add new operations search, decompress and inverse to the compressed suffix arrays. We show that the new index can find occ occurrences of any substring P of the text in O(|P| log n + occ loge n) time for any fixed 1 ≥ e > 0 without access to the text. The index also can decompress a part of the text of length m in O(m + loge n) time. For a text of length n on an alphabet A such that |A| = polylog(n), our new index occupies only O(nH0 + n log log |A|) bits where H0 ≤ log |A| is the order-0 entropy of the text. Especially for e = 1 the size is nH0 + O(n log log |A|) bits. Therefore the index will be smaller than the text, which means we can perform fast queries from compressed texts.

Journal of Discrete Algorithms | 2007

Succinct data structures for flexible text retrieval systems

Kunihiko Sadakane

We propose succinct data structures for text retrieval systems supporting document listing queries and ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support queries only for some predetermined keywords. Recently Muthukrishnan proposed a data structure for document listing queries for arbitrary patterns at the cost of data structure size. For computing the tf*idf scores there has been no efficient data structures for arbitrary patterns. Our new data structures support these queries using small space. The space is only 2/@e times the size of compressed documents plus 10n bits for a document collection of length n, for any 0<@e=<1. This is much smaller than the previous O(nlogn) bit data structures. Query time is O(m+qlog^@en) for listing and computing tf*idf scores for all q documents containing a given pattern of length m. Our data structures are flexible in a sense that they support queries for arbitrary patterns.

international symposium on algorithms and computation | 2000

Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array

Kunihiko Sadakane

A compressed text database based on the compressed suffix array is proposed. The compressed suffix array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies O(n log |Σ|) bits for the alphabet Σ. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text in O(|P| log n + occ logƐ n) time and decompress a part of the text of length l in O(l + logƐ n) time for any given 1 ≥ Ɛ > 0. Our data structure occupies only n(2/Ɛ (3/2 + H0 + 2 log H0) + 2 + 4 logƐ n/logƐ n-1)+o(n)+O(|Σ| log |Σ|) bits where H0 ≤ log |Σ| is the order-0 entropy of the text. We also show the relationship with the opportunistic data structure of Ferragina and Manzini.

ACM Transactions on Algorithms | 2014

Fully Functional Static and Dynamic Succinct Trees

Gonzalo Navarro; Kunihiko Sadakane

We propose new succinct representations of ordinal trees and match various space/time lower bounds. It is known that any n-node static tree can be represented in 2n + o(n) bits so that a number of operations on the tree can be supported in constant time under the word-RAM model. However, the data structures are complicated and difficult to dynamize. We propose a simple and flexible data structure, called the range min-max tree, that reduces the large number of relevant tree operations considered in the literature to a few primitives that are carried out in constant time on polylog-sized trees. The result is extended to trees of arbitrary size, retaining constant time and reaching 2n + O(n/polylog(n)) bits of space. This space is optimal for a core subset of the operations supported and significantly lower than in any previous proposal. For the dynamic case, where insertion/deletion (indels) of nodes is allowed, the existing data structures support a very limited set of operations. Our data structure builds on the range min-max tree to achieve 2n + O(n/log n) bits of space and O(log n) time for all operations supported in the static scenario, plus indels. We also propose an improved data structure using 2n + O(nlog log n/log n) bits and improving the time to the optimal O(log n/log log n) for most operations. We extend our support to forests, where whole subtrees can be attached to or detached from others, in time O(log1+ε n) for any ε > 0. Such operations had not been considered before. Our techniques are of independent interest. An immediate derivation yields an improved solution to range minimum/maximum queries where consecutive elements differ by ± 1, achieving n + O(n/polylog(n)) bits of space. A second one stores an array of numbers supporting operations sum and search and limited updates, in optimal time O(log n/log log n). A third one allows representing dynamic bitmaps and sequences over alphabets of size σ, supporting rank/select and indels, within zero-order entropy bounds and time O(log n log σ/(log log n)2) for all operations. This time is the optimal O(log n/log log n) on bitmaps and polylog-sized alphabets. This improves upon the best existing bounds for entropy-bounded storage of dynamic sequences, compressed full-text self-indexes, and compressed-space construction of the Burrows-Wheeler transform.

foundations of computer science | 2003

Breaking a time-and-space barrier in constructing full-text indices

Wing-Kai Hon; Kunihiko Sadakane; Wing-Kin Sung

Suffix trees and suffix arrays are the most prominent full-text indices, and their construction algorithms are well studied. It has been open for a long time whether these indices can be constructed in both O(n log n) time and O(n log n)-bit working space, where n denotes the length of the text. In the literature, the fastest algorithm runs in O(n) time, while it requires O(n log n)-bit working space. On the other hand, the most space-efficient algorithm requires O(n)-bit working space while it runs in O(n log n) time. This paper breaks the long-standing time-and-space barrier under the unit-cost word RAM. We give an algorithm for constructing the suffix array which takes O(n) time and O(n)-bit working space, for texts with constant-size alphabets. Note that both the time and the space bounds are optimal. For constructing the suffix tree, our algorithm requires O(n log/sup /spl epsi//n) time and O(n)-bit working space for any 0 < /spl epsi/ < 1. Apart from that, our algorithm can also be adopted to build other existing full-text indices, such as Compressed Suffix Tree, Compressed Suffix Arrays and FM-index. We also study the general case where the size of the alphabet A is not constant. Our algorithm can construct a suffix array and a suffix tree using optimal O(n log |A|)-bit working space while running in O(n log log |A|) time and O(n log/sup /spl epsi//n) time, respectively. These are the first algorithms that achieve 0(n log n) time with optimal working space, under a reasonable assumption that log |A| = o(log n).

symposium on discrete algorithms | 2006

Squeezing succinct data structures into entropy bounds

Kunihiko Sadakane; Roberto Grossi

Consider a sequence S of n symbols drawn from an alphabet A = {1, 2,. . .,σ}, stored as a binary string of nlog σ bits. A succinct data structure on S supports a given set of primitive operations on S using just f (n) = o(n log σ) extra bits. We present a technique for transforming succinct data structures (which do not change the binary content of S) into compressed data structures using nH<inf>k</inf> + f(n) + O(n log σ + log log<inf>σ</inf> n + k)/ log<inf>σ</inf> n) bits of space, where H<inf>k</inf> ≤ log σ is the kth-order empirical entropy of S. When k + log σ = o(log n), we improve the space complexity of the succinct data structure from n log σ + o(n log σ) to n H<inf>k</inf> + o(nlog σ) bits by keeping S in compressed format, so that any substring of O(log σ n) symbols in S (i.e. O(log n) bits) can be decoded on the fly in constant time. Thus, the time complexity of the supported operations does not change asymptotically. Namely, if an operation takes t(n) time in the succinct data structure, it requires O(t(n)) time in the resulting compressed data structure. Using this simple approach we improve the space complexity of some of the best known results on succinct data structures We extend our results to handle another definition of entropy.

computing and combinatorics conference | 2002

A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays

Wing-Kai Hon; Tak Wah Lam; Kunihiko Sadakane; Wing-Kin Sung; Siu-Ming Yiu

With the first human DNA being decoded into a sequence of about 2.8 billion characters, much biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in the main memory so that any pattern can be located efficiently. This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits. However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (i.e., more than 13 gigabytes for human DNA). This paper initiates the study of constructing compressed suffix arrays directly from the text. The main contribution is a construction algorithm that uses only O(n) bits of working memory, and the time complexity is O(n log n). Our construction algorithm is also time and space efficient for texts with large alphabets such as Chinese or Japanese. Precisely, when the alphabet size is |Σ|, the working space is O(n log |Σ|) bits, and the time complexity remains O(n log n), which is independent of |Σ|.

ACM Transactions on Algorithms | 2007

Compressed indexes for dynamic text collections

Ho-Leung Chan; Wing-Kai Hon; Tak Wah Lam; Kunihiko Sadakane

Let T be a string with n characters over an alphabet of constant size. A recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [Ferragina and Manzini 2000; Grossi and Vitter 2000]. Yet the compressed nature of such indexes also makes them difficult to update dynamically. This article extends the work on optimal-space indexing to a dynamic collection of texts. Our first result is a compressed solution to the library management problem, where we show an index of O(n) bits for a text collection L of total length n, which can be updated in O(|T| log n) time when a text T is inserted or deleted from L; also, the index supports searching the occurrences of any pattern P in all texts in L in O(|P| log n + occ log2 n) time, where occ is the number of occurrences. Our second result is a compressed solution to the dictionary matching problem, where we show an index of O(d) bits for a pattern collection D of total length d, which can be updated in O(|P| log2 d) time when a pattern P is inserted or deleted from D; also, the index supports searching the occurrences of all patterns of D in any text T in O((|T| + occ)log2 d) time. When compared with the O(d log d)-bit suffix-tree-based solution of Amir et al. [1995], the compact solution increases the query time by roughly a factor of log d only. The solution to the dictionary matching problem is based on a new compressed representation of a suffix tree. Precisely, we give an O(n)-bit representation of a suffix tree for a dynamic collection of texts whose total length is n, which supports insertion and deletion of a text T in O(|T| log2 n) time, as well as all suffix tree traversal operations, including forward and backward suffix links. This work can be regarded as a generalization of the compressed representation of static texts. In the study of the aforementioned result, we also derive the first O(n)-bit representation for maintaining n pairs of balanced parentheses in O(log n/log log n) time per operation, matching the time complexity of the previous O(n log n)-bit solution.

Explore More